SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation
March 31, 2019 Β· Entered Twilight Β· π Conference on Intelligent Text Processing and Computational Linguistics
"Last commit was 6.0 years ago (β₯5 year threshold)"
Evidence collected by the PWNC Scanner
Repo contents: LICENSE.txt, README.md, annotation_instructions, datasets, evaluate.py, utils.py
Authors
Albina Khusainova, Adil Khan, AdΓn RamΓrez Rivera
arXiv ID
1904.00365
Category
cs.CL: Computation & Language
Citations
9
Venue
Conference on Intelligent Text Processing and Computational Linguistics
Repository
https://github.com/tat-nlp/SART
β 3
Last Checked
1 month ago
Abstract
There is a huge imbalance between languages currently spoken and corresponding resources to study them. Most of the attention naturally goes to the "big" languages: those which have the largest presence in terms of media and number of speakers. Other less represented languages sometimes do not even have a good quality corpus to study them. In this paper, we tackle this imbalance by presenting a new set of evaluation resources for Tatar, a language of the Turkic language family which is mainly spoken in Tatarstan Republic, Russia. We present three datasets: Similarity and Relatedness datasets that consist of human scored word pairs and can be used to evaluate semantic models; and Analogies dataset that comprises analogy questions and allows to explore semantic, syntactic, and morphological aspects of language modeling. All three datasets build upon existing datasets for the English language and follow the same structure. However, they are not mere translations. They take into account specifics of the Tatar language and expand beyond the original datasets. We evaluate state-of-the-art word embedding models for two languages using our proposed datasets for Tatar and the original datasets for English and report our findings on performance comparison.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Computation & Language
π
π
Old Age
π
π
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
R.I.P.
π»
Ghosted
Language Models are Few-Shot Learners
R.I.P.
π»
Ghosted
RoBERTa: A Robustly Optimized BERT Pretraining Approach
R.I.P.
π»
Ghosted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
R.I.P.
π»
Ghosted