MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs
November 14, 2024 ยท Declared Dead ยท ๐ arXiv.org
Repo contents: knowledge_eval.json, paper.pdf, reasoning_eval.json, semantics_eval.json, syntax_eval.json
Authors
Mengyuan Zhang, Ruihui Wang, Bo Xia, Yuan Sun, Xiaobing Zhao
arXiv ID
2411.09492
Category
cs.CL: Computation & Language
Cross-listed
cs.AI
Citations
1
Venue
arXiv.org
Repository
https://github.com/joenahm/MM-Eval
โญ 7
Last Checked
1 month ago
Abstract
Large language models (LLMs) excel in high-resource languages but face notable challenges in low-resource languages like Mongolian. This paper addresses these challenges by categorizing capabilities into language abilities (syntax and semantics) and cognitive abilities (knowledge and reasoning). To systematically evaluate these areas, we developed MM-Eval, a specialized dataset based on Modern Mongolian Language Textbook I and enriched with WebQSP and MGSM datasets. Preliminary experiments on models including Qwen2-7B-Instruct, GLM4-9b-chat, Llama3.1-8B-Instruct, GPT-4, and DeepseekV2.5 revealed that: 1) all models performed better on syntactic tasks than semantic tasks, highlighting a gap in deeper language understanding; and 2) knowledge tasks showed a moderate decline, suggesting that models can transfer general knowledge from high-resource to low-resource contexts. The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge, and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in low-resource languages like Mongolian. The dataset is available at https://github.com/joenahm/MM-Eval.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
RoBERTa: A Robustly Optimized BERT Pretraining Approach
R.I.P.
๐ป
Ghosted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
R.I.P.
๐ป
Ghosted
Deep contextualized word representations
Died the same way โ ๐ฆด Skeleton Repo
R.I.P.
๐ฆด
Skeleton Repo
EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification
R.I.P.
๐ฆด
Skeleton Repo
Deep Learning for 3D Point Clouds: A Survey
R.I.P.
๐ฆด
Skeleton Repo
Adversarial Examples: Attacks and Defenses for Deep Learning
R.I.P.
๐ฆด
Skeleton Repo