🌅
🌅
Old Age
Investigating the Scaling Effect of Instruction Templates for Training Multimodal Language Model
December 11, 2024 · Declared Dead · + Add venue
Authors
Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Jiarui Jin, Ao Luo, Yuan Lu, Li Yao, Cunjian Chen, Julian McAuley, Wentao Zhang, Hanqian Wu
arXiv ID
2412.08307
Category
cs.CV: Computer Vision
Citations
1
Repository
https://github.com/shijian2001/TemplateScaling
Last Checked
1 month ago
Abstract
Current multimodal language model (MLM) training approaches overlook the influence of instruction templates. Previous research deals with this problem by leveraging hand-crafted or model-generated templates, failing to investigate the scaling effect of instruction templates on MLM training. In this work, we propose a programmatic instruction template generator capable of producing over 15K unique instruction templates by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively explore MLM's performance across various template scales in the training process. Our investigation into scaling instruction templates for MLM training demonstrates that MLM capabilities do not consistently improve with increasing template scale. Instead, optimal performance is achieved at a medium template scale. Models trained with data augmented at the optimal template scale achieve performance gains of up to 10% over those trained on the original data and achieve the best overall performance compared with the similar-scale MLMs tuned on at most 75 times the scale of our augmented dataset. The code will be publicly available at https://github.com/shijian2001/TemplateScaling.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Computer Vision
🌅
🌅
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
👻
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
🌅
🌅
Old Age
SSD: Single Shot MultiBox Detector
🌅
🌅
Old Age
Squeeze-and-Excitation Networks
R.I.P.
👻
Ghosted
Rethinking the Inception Architecture for Computer Vision
Died the same way — ⚰️ The Empty Tomb
R.I.P.
⚰️
The Empty Tomb
DSFD: Dual Shot Face Detector
R.I.P.
⚰️
The Empty Tomb
InstanceCut: from Edges to Instances with MultiCut
R.I.P.
⚰️
The Empty Tomb
FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis
R.I.P.
⚰️
The Empty Tomb