Listen to Look: Action Recognition by Previewing Audio

December 10, 2019 · Entered Twilight · 🏛 Computer Vision and Pattern Recognition

"No code URL or promise found in abstract"
"Code repo scraped from project page (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: CODE_OF_CONDUCT.md, CONTRIBUTING.md, LICENSE, README.md, data.py, listen_to_look_single_modality, listen_to_look_teaser.png, main.py, models, opts.py, train.py, utils, validate.py

Authors Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani arXiv ID 1912.04487 Category cs.CV: Computer Vision Cross-listed cs.LG, cs.SD, eess.AS Citations 284 Venue Computer Vision and Pattern Recognition Repository https://github.com/facebookresearch/Listen-to-Look ⭐ 130 Last Checked 6 days ago

Abstract

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a single frame and its accompanying audio---reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on ImgAud2Vid, we further propose ImgAud-Skimming, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.