R.I.P.
๐ป
Ghosted
Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
March 23, 2026 ยท Grace Period ยท ๐ ICASSP 2026
Authors
Chengzhi Li, Heyan Huang, Ping Jian, Yanghao Zhou
arXiv ID
2603.21948
Category
cs.MM: Multimedia
Citations
0
Venue
ICASSP 2026
Abstract
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Multimedia
๐
๐
Old Age
Quality Assessment of In-the-Wild Videos
R.I.P.
๐ป
Ghosted
Viewport-Adaptive Navigable 360-Degree Video Delivery
R.I.P.
๐ป
Ghosted
A Comprehensive Survey on Cross-modal Retrieval
R.I.P.
๐ป
Ghosted
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
R.I.P.
๐ป
Ghosted