Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

October 14, 2020 · Entered Twilight · 🏛 Annual Meeting of the Association for Computational Linguistics

"Last commit was 5.0 years ago (≥5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: LICENSE, NOTICE, README.md, length_adaptive_transformer, run_glue.py, run_squad.py

Authors Gyuwan Kim, Kyunghyun Cho arXiv ID 2010.07003 Category cs.CL: Computation & Language Cross-listed cs.LG Citations 107 Venue Annual Meeting of the Association for Computational Linguistics Repository https://github.com/clovaai/length-adaptive-transformer ⭐ 102 Last Checked 1 month ago

Abstract

Despite transformers' impressive accuracy, their computational cost is often prohibitive to use with limited computational resources. Most previous approaches to improve inference efficiency require a separate model for each possible computational budget. In this paper, we extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We train a transformer with LengthDrop, a structural variant of dropout, which stochastically determines a sequence length at each layer. We then conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. Additionally, we significantly extend the applicability of PoWER-BERT beyond sequence-level classification into token-level classification with Drop-and-Restore process that drops word-vectors temporarily in intermediate layers and restores at the last layer if necessary. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups, including span-based question answering and text classification. Code is available at https://github.com/clovaai/length-adaptive-transformer.