๐
๐
Old Age
CAST: Cross-Attention in Space and Time for Video Action Recognition
November 30, 2023 ยท Entered Twilight ยท ๐ Neural Information Processing Systems
Repo contents: LICENSE, README.md, annotations, dataset, engine_for_compomodel.py, engine_for_onemodel.py, figs, models, run_bidirection.py, run_bidirection_compo.py, scripts, util_tools
Authors
Dongho Lee, Jongseo Lee, Jinwoo Choi
arXiv ID
2311.18825
Category
cs.CV: Computer Vision
Citations
31
Venue
Neural Information Processing Systems
Repository
https://github.com/KHU-VLL/CAST
โญ 55
Last Checked
1 month ago
Abstract
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted