Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER

August 29, 2019 · Entered Twilight · 🏛 arXiv.org

"Last commit was 6.0 years ago (≥5 year threshold)"

Evidence collected by the PWNC Scanner

Repo contents: README.md, analyze_prediction.py, conll2003.py, conll2003, evaluate.py, model.py, ontonotes.py, ontonotes, wnut2016.py, wnut2016, wnut2017.py, wnut2017

Authors Peng-Hsuan Li, Tsu-Jui Fu, Wei-Yun Ma arXiv ID 1908.11046 Category cs.CL: Computation & Language Citations 6 Venue arXiv.org Repository https://github.com/jacobvsdanniel/cross-ner ⭐ 9 Last Checked 2 months ago

Abstract

BiLSTM has been prevalently used as a core module for NER in a sequence-labeling setup. State-of-the-art approaches use BiLSTM with additional resources such as gazetteers, language-modeling, or multi-task supervision to further improve NER. This paper instead takes a step back and focuses on analyzing problems of BiLSTM itself and how exactly self-attention can bring improvements. We formally show the limitation of (CRF-)BiLSTM in modeling cross-context patterns for each word -- the XOR limitation. Then, we show that two types of simple cross-structures -- self-attention and Cross-BiLSTM -- can effectively remedy the problem. We test the practical impacts of the deficiency on real-world NER datasets, OntoNotes 5.0 and WNUT 2017, with clear and consistent improvements over the baseline, up to 8.7% on some of the multi-token entity mentions. We give in-depth analyses of the improvements across several aspects of NER, especially the identification of multi-token mentions. This study should lay a sound foundation for future improvements on sequence-labeling NER. (Source codes: https://github.com/jacobvsdanniel/cross-ner)