R.I.P.
๐ป
Ghosted
Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition
March 11, 2026 ยท Grace Period ยท ๐ ICASSP 2026
Authors
Inyong Koo, yeeun Seong, Minseok Son, Jaehyuk Jang, Changick Kim
arXiv ID
2603.11095
Category
cs.MM: Multimedia
Cross-listed
cs.SD,
eess.SP
Citations
0
Venue
ICASSP 2026
Abstract
Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Multimedia
๐
๐
Old Age
Quality Assessment of In-the-Wild Videos
R.I.P.
๐ป
Ghosted
Viewport-Adaptive Navigable 360-Degree Video Delivery
R.I.P.
๐ป
Ghosted
A Comprehensive Survey on Cross-modal Retrieval
R.I.P.
๐ป
Ghosted
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
R.I.P.
๐ป
Ghosted