MiLMo:Minority Multilingual Pre-trained Language Model
December 04, 2022 ยท Entered Twilight ยท ๐ IEEE International Conference on Systems, Man and Cybernetics
"No code URL or promise found in abstract"
"Code repo scraped from project page (backfill)"
Evidence collected by the PWNC Scanner
Repo contents: Document Classes.xlsx, README.md, Tibetan News Classification Corpus.zip, classifier_v_0, title classes.xlsx
Authors
Junjie Deng, Hanru Shi, Xinhe Yu, Wugedele Bao, Yuan Sun, Xiaobing Zhao
arXiv ID
2212.01779
Category
cs.CL: Computation & Language
Citations
2
Venue
IEEE International Conference on Systems, Man and Cybernetics
Repository
https://github.com/FudanNLP/Tibetan-Classification
โญ 18
Last Checked
1 month ago
Abstract
Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
RoBERTa: A Robustly Optimized BERT Pretraining Approach
R.I.P.
๐ป
Ghosted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
R.I.P.
๐ป
Ghosted