Predicting Visual Features from Text for Image and Video Caption Retrieval

September 05, 2017 ยท Entered Twilight ยท ๐Ÿ› IEEE transactions on multimedia

๐ŸŒ… TWILIGHT: Old Age
Predates the code-sharing era โ€” a pioneer of its time

"Last commit was 6.0 years ago (โ‰ฅ5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: .gitignore, .ipynb_checkpoints, LICENSE, README.md, TEMPLATE_do_test.sh, __init__.py, basic, do_all.sh, do_all_own_data.sh, do_gene_vocab.sh, do_get_dataset.sh, get_word_vob.py, simpleknn, tensorboard_visual.jpg, util, w2vv.jpg, w2vv.py, w2vv_pred.py, w2vv_representation.ipynb, w2vv_tester.py, w2vv_trainer.py

Authors Jianfeng Dong, Xirong Li, Cees G. M. Snoek arXiv ID 1709.01362 Category cs.CV: Computer Vision Citations 238 Venue IEEE transactions on multimedia Repository https://github.com/danieljf24/w2vv โญ 70 Last Checked 1 month ago
Abstract
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision