SciREX: A Challenge Dataset for Document-Level Information Extraction
May 01, 2020 ยท Entered Twilight ยท ๐ Annual Meeting of the Association for Computational Linguistics
"Last commit was 5.0 years ago (โฅ5 year threshold)"
Evidence collected by the PWNC Scanner
Repo contents: .gitignore, Annotation Guidelines.pdf, LICENSE, README.md, Statistics.md, docs, dygiepp, notebooks, requirements.txt, scirex, scirex_dataset, scirex_utilities, test
Authors
Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, Iz Beltagy
arXiv ID
2005.00512
Category
cs.CL: Computation & Language
Cross-listed
cs.IR,
cs.LG
Citations
186
Venue
Annual Meeting of the Association for Computational Linguistics
Repository
https://github.com/allenai/SciREX
โญ 138
Last Checked
1 month ago
Abstract
Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level $N$-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
RoBERTa: A Robustly Optimized BERT Pretraining Approach
R.I.P.
๐ป
Ghosted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
R.I.P.
๐ป
Ghosted