🌅
🌅
Old Age
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
November 03, 2018 · 🏛 Annual Meeting of the Association for Computational Linguistics
"No code URL or promise found in abstract"
"HuggingFace models found (backfill)"
Evidence collected by the PWNC Scanner
Authors
Mikel Artetxe, Holger Schwenk
arXiv ID
1811.01136
Category
cs.CL: Computation & Language
Cross-listed
cs.AI,
cs.LG
Citations
217
Venue
Annual Meeting of the Association for Computational Linguistics
Repository
https://huggingface.co/datasets/ngoan/WikiMatrix.en-vi
Last Checked
9 days ago
Abstract
Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC mining task and the UN reconstruction task by more than 10 F1 and 30 precision points, respectively. Filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Computation & Language
🌅
🌅
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
🌅
🌅
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
🔮
🔮
The Ethereal
Effective Approaches to Attention-based Neural Machine Translation
🌅
🌅
Old Age
A large annotated corpus for learning natural language inference
🌅
🌅
Old Age