๐
๐
Old Age
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks
October 30, 2023 ยท Declared Dead ยท ๐ International Conference on Computational Linguistics
Authors
Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi
arXiv ID
2310.19740
Category
cs.CL: Computation & Language
Citations
19
Venue
International Conference on Computational Linguistics
Repository
https://github.com/qtli/CoEval}
Last Checked
1 month ago
Abstract
Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluations that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such as numerical reasoning. We also find that LLM-pre-drafting before human evaluation can help reduce the impact of human subjectivity and minimize annotation outliers in pure human evaluation, leading to more objective evaluation.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
RoBERTa: A Robustly Optimized BERT Pretraining Approach
R.I.P.
๐ป
Ghosted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
R.I.P.
๐ป
Ghosted
Deep contextualized word representations
Died the same way โ ๐ 404 Not Found
R.I.P.
๐
404 Not Found
Deep High-Resolution Representation Learning for Visual Recognition
R.I.P.
๐
404 Not Found
HuggingFace's Transformers: State-of-the-art Natural Language Processing
R.I.P.
๐
404 Not Found
CCNet: Criss-Cross Attention for Semantic Segmentation
R.I.P.
๐
404 Not Found