Large Product Key Memory for Pretrained Language Models

October 08, 2020 · Declared Dead · 🏛 Findings

Authors Gyuwan Kim, Tae-Hwan Jung arXiv ID 2010.03881 Category cs.CL: Computation & Language Citations 4 Venue Findings Repository https://github.com/clovaai/pkm-transformers ⭐ 10 Last Checked 1 month ago

Abstract

Product key memory (PKM) proposed by Lample et al. (2019) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead. However, their empirical application is only limited to causal language modeling. Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be finetuned for a wide variety of downstream NLP tasks. We define a new memory usage metric, and careful observation using this metric reveals that most memory slots remain outdated during the training of PKM-augmented models. To train better PLMs by tackling this issue, we propose simple but effective solutions: (1) initialization from the model weights pretrained without memory and (2) augmenting PKM by addition rather than replacing a feed-forward network. We verify that both of them are crucial for the pretraining of PKM-augmented PLMs, enhancing memory utilization and downstream performance. Code and pretrained weights are available at https://github.com/clovaai/pkm-transformers.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 💻 Repository 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Computation & Language

🌅 🌅 Old Age

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, ... (+6 more)

cs.CL 🏛 NeurIPS 📚 166.0K cites 8 years ago

🌅 🌅 Old Age

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, ... (+2 more)

cs.CL 🏛 NAACL 📚 110.2K cites 7 years ago

R.I.P. 👻 Ghosted

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, ... (+29 more)

cs.CL 🏛 NeurIPS 📚 54.2K cites 5 years ago

R.I.P. 👻 Ghosted

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, ... (+8 more)

cs.CL 🏛 arXiv 📚 28.4K cites 6 years ago

R.I.P. 👻 Ghosted

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, ... (+6 more)

cs.CL 🏛 ACL 📚 12.3K cites 6 years ago

R.I.P. 👻 Ghosted

Deep contextualized word representations

Matthew E. Peters, Mark Neumann, ... (+5 more)

cs.CL 🏛 NAACL 📚 12.0K cites 8 years ago

Died the same way — ⚰️ The Empty Tomb

R.I.P. ⚰️ The Empty Tomb

DSFD: Dual Shot Face Detector

Jian Li, Yabiao Wang, ... (+7 more)

cs.CV 🏛 CVPR 📚 462 cites 7 years ago

R.I.P. ⚰️ The Empty Tomb

InstanceCut: from Edges to Instances with MultiCut

Alexander Kirillov, Evgeny Levinkov, ... (+3 more)

cs.CV 🏛 CVPR 📚 261 cites 9 years ago

R.I.P. ⚰️ The Empty Tomb

FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis

Kuangxiao Gu, Yuqian Zhou, Thomas Huang

cs.CV 🏛 AAAI 📚 62 cites 6 years ago

R.I.P. ⚰️ The Empty Tomb

Personalized Showcases: Generating Multi-Modal Explanations for Recommendations

An Yan, Zhankui He, ... (+3 more)

cs.IR 🏛 SIGIR 📚 58 cites 3 years ago