Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning

December 12, 2024 · Declared Dead · 🏛 ACM Multimedia Asia

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Meng Shen, Yake Wei, Jianxiong Yin, Deepu Rajan, Di Hu, Simon See arXiv ID 2412.09126 Category cs.MM: Multimedia Cross-listed cs.AI, cs.LG Citations 1 Venue ACM Multimedia Asia Last Checked 3 months ago

Abstract

Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.