๐
๐
Old Age
A Closer Look into Automatic Evaluation Using Large Language Models
October 09, 2023 ยท Entered Twilight ยท ๐ arXiv.org
Repo contents: .gitignore, README.md, all_eval.py, data, gpt4_eval_summeval.py, gpt4_eval_topical_chat.py, meta_eval_summeval.py, paper.pdf, prompts, requirements.txt, results, run.sh, significance.py
Authors
Cheng-Han Chiang, Hung-yi Lee
arXiv ID
2310.05657
Category
cs.CL: Computation & Language
Citations
18
Venue
arXiv.org
Repository
https://github.com/d223302/A-Closer-Look-To-LLM-Evaluation/
โญ 19
Last Checked
3 months ago
Abstract
Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
RoBERTa: A Robustly Optimized BERT Pretraining Approach
R.I.P.
๐ป
Ghosted
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
R.I.P.
๐ป
Ghosted