VILLS -- Video-Image Learning to Learn Semantics for Person Re-Identification

November 27, 2023 · Declared Dead · 🏛 IEEE Workshop/Winter Conference on Applications of Computer Vision

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Siyuan Huang, Ram Prabhakar, Yuxiang Guo, Rama Chellappa, Cheng Peng arXiv ID 2311.17074 Category cs.CV: Computer Vision Citations 3 Venue IEEE Workshop/Winter Conference on Applications of Computer Vision Last Checked 3 months ago

Abstract

Person Re-identification is a research area with significant real world applications. Despite recent progress, existing methods face challenges in robust re-identification in the wild, e.g., by focusing only on a particular modality and on unreliable patterns such as clothing. A generalized method is highly desired, but remains elusive to achieve due to issues such as the trade-off between spatial and temporal resolution and imperfect feature extraction. We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos. VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features. Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space. By Leveraging self-supervised, large-scale pre-training, VILLS establishes a new State-of-The-Art that significantly outperforms existing image and video-based methods.