R.I.P.
👻
Ghosted
Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
October 13, 2025 · Declared Dead · 🏛 arXiv.org
Authors
KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung
arXiv ID
2510.11330
Category
cs.SD: Sound
Cross-listed
cs.AI,
cs.CL,
cs.LG,
eess.AS
Citations
1
Venue
arXiv.org
Repository
https://github.com/DevKiHyun/Diffusion-Link
⭐ 1
Last Checked
1 month ago
Abstract
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Sound
R.I.P.
👻
Ghosted
CNN Architectures for Large-Scale Audio Classification
R.I.P.
👻
Ghosted
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
R.I.P.
👻
Ghosted
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
R.I.P.
👻
Ghosted
WaveGlow: A Flow-based Generative Network for Speech Synthesis
R.I.P.
👻
Ghosted
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
Died the same way — ⚰️ The Empty Tomb
R.I.P.
⚰️
The Empty Tomb
DSFD: Dual Shot Face Detector
R.I.P.
⚰️
The Empty Tomb
InstanceCut: from Edges to Instances with MultiCut
R.I.P.
⚰️
The Empty Tomb
FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis
R.I.P.
⚰️
The Empty Tomb