CAST: Cross-Attention in Space and Time for Video Action Recognition

November 30, 2023 ยท Entered Twilight ยท ๐Ÿ› Neural Information Processing Systems

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: LICENSE, README.md, annotations, dataset, engine_for_compomodel.py, engine_for_onemodel.py, figs, models, run_bidirection.py, run_bidirection_compo.py, scripts, util_tools

Authors Dongho Lee, Jongseo Lee, Jinwoo Choi arXiv ID 2311.18825 Category cs.CV: Computer Vision Citations 31 Venue Neural Information Processing Systems Repository https://github.com/KHU-VLL/CAST โญ 55 Last Checked 1 month ago
Abstract
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision