Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

December 12, 2024 ยท Declared Dead ยท ๐Ÿ› arXiv.org

๐Ÿ’€ CAUSE OF DEATH: 404 Not Found
Code link is broken/dead
Authors Yifan Zhang, Junhui Hou arXiv ID 2412.08973 Category cs.CV: Computer Vision Cross-listed cs.AI Citations 1 Venue arXiv.org Repository https://github.com/Eaphan/CMCR Last Checked 2 months ago
Abstract
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR (Cross-Modal Comprehensive Representation Learning), to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at https://github.com/Eaphan/CMCR.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision

Died the same way โ€” ๐Ÿ’€ 404 Not Found