Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

November 19, 2020 ยท Entered Twilight ยท ๐Ÿ› arXiv.org

๐ŸŒ… TWILIGHT: Old Age
Predates the code-sharing era โ€” a pioneer of its time

"Last commit was 5.0 years ago (โ‰ฅ5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: README.md, build_pairs.py, data, download_single.py, main_subtitle.py, main_video.py, process_subtitle.py, subtitle2vid.py

Authors Yujie Zhong, Linhai Xie, Sen Wang, Lucia Specia, Yishu Miao arXiv ID 2011.09634 Category cs.CV: Computer Vision Citations 0 Venue arXiv.org Repository https://github.com/zyj-13/WAL Last Checked 2 months ago
Abstract
In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset can be downloaded at https://github.com/zyj-13/WAL.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision