VL-BERT: Pre-training of Generic Visual-Linguistic Representations

August 22, 2019 · Declared Dead · 🏛 International Conference on Learning Representations

Authors Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai arXiv ID 1908.08530 Category cs.CV: Computer Vision Cross-listed cs.CL, cs.LG Citations 1.8K Venue International Conference on Learning Representations Repository https://github.com/jackroos/VL-BERT} Last Checked 1 month ago

Abstract

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{https://github.com/jackroos/VL-BERT}.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 💻 Repository 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Computer Vision

🌅 🌅 Old Age

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, ... (+2 more)

cs.CV 🏛 CVPR 📚 220.4K cites 10 years ago

🌅 🌅 Old Age

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, ... (+2 more)

cs.CV 🏛 IEEE TPAMI 📚 70.4K cites 10 years ago

R.I.P. 👻 Ghosted

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, Santosh Divvala, ... (+2 more)

cs.CV 🏛 CVPR 📚 43.4K cites 10 years ago

🌅 🌅 Old Age

SSD: Single Shot MultiBox Detector

Wei Liu, Dragomir Anguelov, ... (+5 more)

cs.CV 🏛 ECCV 📚 33.8K cites 10 years ago

🌅 🌅 Old Age

Squeeze-and-Excitation Networks

Jie Hu, Li Shen, ... (+3 more)

cs.CV 🏛 CVPR 📚 32.3K cites 8 years ago

R.I.P. 👻 Ghosted

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, ... (+3 more)

cs.CV 🏛 CVPR 📚 30.2K cites 10 years ago

Died the same way — 💀 404 Not Found

R.I.P. 💀 404 Not Found

Deep High-Resolution Representation Learning for Visual Recognition

Jingdong Wang, Ke Sun, ... (+10 more)

cs.CV 🏛 IEEE TPAMI 📚 4.4K cites 6 years ago

R.I.P. 💀 404 Not Found

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, ... (+20 more)

cs.CL 🏛 arXiv 📚 3.5K cites 6 years ago

R.I.P. 💀 404 Not Found

CCNet: Criss-Cross Attention for Semantic Segmentation

Zilong Huang, Xinggang Wang, ... (+5 more)

cs.CV 🏛 ICCV 📚 2.9K cites 7 years ago

R.I.P. 💀 404 Not Found

Unified Perceptual Parsing for Scene Understanding

Tete Xiao, Yingcheng Liu, ... (+3 more)

cs.CV 🏛 ECCV 📚 2.3K cites 7 years ago