๐
๐
Old Age
Masked Vision-Language Transformers for Scene Text Recognition
November 09, 2022 ยท Entered Twilight ยท ๐ British Machine Vision Conference
Repo contents: LICENSE, README.md, dataset.py, engine_mvlt_finetune.py, engine_mvlt_pretrain.py, main_mvlt_finetune.py, main_mvlt_pretrain.py, models_mvlt.py, models_mvlt_finetune.py, scripts, util
Authors
Jie Wu, Ying Peng, Shengming Zhang, Weigang Qi, Jian Zhang
arXiv ID
2211.04785
Category
cs.CV: Computer Vision
Citations
5
Venue
British Machine Vision Conference
Repository
https://github.com/onealwj/MVLT
โญ 29
Last Checked
1 month ago
Abstract
Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted