Consensus-based Sequence Training for Video Captioning
December 27, 2017 ยท Entered Twilight ยท ๐ arXiv.org
"Last commit was 8.0 years ago (โฅ5 year threshold)"
Evidence collected by the PWNC Scanner
Repo contents: .gitignore, Makefile, README.md, build_vocab.py, compute_ciderdf.py, compute_scores.py, convert_datainfo2cocofmt.py, create_sequencelabel.py, dataloader.py, model.py, opts.py, preprocess_datainfo.py, standalize_format.py, test.py, train.py, utils.py
Authors
Sang Phan, Gustav Eje Henter, Yusuke Miyao, Shin'ichi Satoh
arXiv ID
1712.09532
Category
cs.CV: Computer Vision
Citations
22
Venue
arXiv.org
Repository
https://github.com/mynlp/cst_captioning
โญ 60
Last Checked
1 month ago
Abstract
Captioning models are typically trained using the cross-entropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each step to make training converge. We propose a fast approach to optimize one's objective of interest through the REINFORCE algorithm. First we show that, by replacing model samples with ground-truth sentences, RL training can be seen as a form of weighted cross-entropy loss, giving a fast, RL-based pre-training algorithm. Second, we propose to use the consensus among ground-truth captions of the same video as the baseline reward. This can be computed very efficiently. We call the complete proposal Consensus-based Sequence Training (CST). Applied to the MSRVTT video captioning benchmark, our proposals train significantly faster than comparable methods and establish a new state-of-the-art on the task, improving the CIDEr score from 47.3 to 54.2.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted