Multi-Head Attention: Collaborate Instead of Concatenate

June 29, 2020 ยท Declared Dead ยท ๐Ÿ› arXiv.org

๐Ÿ‘ป CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi arXiv ID 2006.16362 Category cs.LG: Machine Learning Cross-listed cs.CL, stat.ML Citations 156 Venue arXiv.org Last Checked 4 months ago
Abstract
Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. Training very large transformer models allowed significant improvement in both fields, but once trained, these networks show symptoms of over-parameterization. For instance, it is known that many attention heads can be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that attention heads learn redundant key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme decreases the number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. Our experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Collaborative multi-head attention reduces the size of the key and query projections by 4 for same accuracy and speed. Our code is public.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning

Died the same way โ€” ๐Ÿ‘ป Ghosted