METEOR Guided Divergence for Video Captioning
December 20, 2022 ยท Entered Twilight ยท ๐ IEEE International Joint Conference on Neural Network
Repo contents: .gitignore, .gitmodules, LICENSE, README.md, captioning_datasets, data, datasets, download_data.sh, epoch_loops, evaluation, loss, main.py, metrics, model.png, model, req.txt, req.yml, sample, scripts, submodules, utilities, videos
Authors
Daniel Lukas Rothenpieler, Shahin Amiriparian
arXiv ID
2212.10690
Category
cs.CV: Computer Vision
Cross-listed
cs.CL,
cs.LG
Citations
3
Venue
IEEE International Joint Conference on Neural Network
Repository
https://github.com/d-rothen/bmhrl
โญ 3
Last Checked
1 month ago
Abstract
Automatic video captioning aims for a holistic visual scene understanding. It requires a mechanism for capturing temporal context in video frames and the ability to comprehend the actions and associations of objects in a given timeframe. Such a system should additionally learn to abstract video sequences into sensible representations as well as to generate natural written language. While the majority of captioning models focus solely on the visual inputs, little attention has been paid to the audiovisual modality. To tackle this issue, we propose a novel two-fold approach. First, we implement a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations. Second, we utilise a Bi-Modal Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture long-term temporal dependencies of the input data as a foundation for our hierarchical captioning module. Using our BMHRL, we show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR scores, respectively on the ActivityNet Captions dataset. Finally, we make our BMHRL framework and trained models publicly available for users and developers at https://github.com/d-rothen/bmhrl.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted