DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

December 20, 2024 Β· Declared Dead Β· πŸ› Computer Vision and Pattern Recognition

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Cijo Jose, ThΓ©o Moutakanni, Dahyun Kang, Federico Baldassarre, TimothΓ©e Darcet, Hu Xu, Daniel Li, Marc Szafraniec, MichaΓ«l Ramamonjisoa, Maxime Oquab, Oriane SimΓ©oni, Huy V. Vo, Patrick Labatut, Piotr Bojanowski arXiv ID 2412.16334 Category cs.CV: Computer Vision Citations 46 Venue Computer Vision and Pattern Recognition Last Checked 3 months ago
Abstract
Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named dino.txt, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Computer Vision

πŸŒ… πŸŒ… Old Age

Fast R-CNN

Ross Girshick

cs.CV πŸ› ICCV πŸ“š 27.7K cites 11 years ago

Died the same way β€” πŸ‘» Ghosted