๐
๐
Old Age
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval
November 21, 2022 ยท Entered Twilight ยท ๐ ECCV Workshops
Repo contents: AVS_datasetload.py, README.md, TtimesV_V3C1_evaluation.py, TtimesV_evaluation.py, TtimesV_iacc3_evaluation.py.py, TtimesV_tester.py, TtimesV_trainer.py, TxV_models, basic, clip, data, loss.py, util
Authors
Damianos Galanopoulos, Vasileios Mezaris
arXiv ID
2211.11351
Category
cs.CV: Computer Vision
Citations
7
Venue
ECCV Workshops
Repository
https://github.com/bmezaris/TextToVideoRetrieval-TtimesV
โญ 4
Last Checked
1 month ago
Abstract
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted