R.I.P.
π»
Ghosted
Audio-Vision Contrastive Learning for Phonological Class Recognition
July 23, 2025 Β· Declared Dead Β· + Add venue
Authors
Daiqi Liu, TomΓ‘s Arias-Vergara, Jana Hutter, Andreas Maier, Paula Andrea PΓ©rez-Toro
arXiv ID
2507.17682
Category
cs.SD: Sound
Cross-listed
cs.CV,
cs.MM,
eess.AS
Citations
0
Repository
https://github.com/DaE-plz/AC_Contrastive_Phonology
Last Checked
2 months ago
Abstract
Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at https://github.com/DaE-plz/AC_Contrastive_Phonology to support future research.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Sound
R.I.P.
π»
Ghosted
CNN Architectures for Large-Scale Audio Classification
R.I.P.
π»
Ghosted
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
R.I.P.
π»
Ghosted
WaveGlow: A Flow-based Generative Network for Speech Synthesis
R.I.P.
π»
Ghosted
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
Died the same way β π 404 Not Found
R.I.P.
π
404 Not Found
Deep High-Resolution Representation Learning for Visual Recognition
R.I.P.
π
404 Not Found
HuggingFace's Transformers: State-of-the-art Natural Language Processing
R.I.P.
π
404 Not Found
CCNet: Criss-Cross Attention for Semantic Segmentation
R.I.P.
π
404 Not Found