Exploring Discrete Diffusion Models for Image Captioning
November 21, 2022 ยท Entered Twilight ยท ๐ arXiv.org
Repo contents: .idea, Images, LICENSE, README.md, captioneval, clip1, coco_test.txt, cog.yaml, data, dynamic_module_utils.py, environment.yml, fix_pre.py, install_req.sh, lr_scheduler.py, misc.py, notebooks, parse_coco.py, parse_conceptual.py, predict.py, requirements.txt, tf_adpt.py, tf_adpt_grad.py, train.py, train_tclip.py, utils.py
Authors
Zixin Zhu, Yixuan Wei, Jianfeng Wang, Zhe Gan, Zheng Zhang, Le Wang, Gang Hua, Lijuan Wang, Zicheng Liu, Han Hu
arXiv ID
2211.11694
Category
cs.CV: Computer Vision
Citations
32
Venue
arXiv.org
Repository
https://github.com/buxiangzhiren/DDCap
โญ 85
Last Checked
1 month ago
Abstract
The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted