Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
November 01, 2024 ยท Declared Dead ยท ๐ arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Evan Miller
arXiv ID
2411.00640
Category
stat.AP
Cross-listed
cs.CL
Citations
71
Venue
arXiv.org
Last Checked
1 month ago
Abstract
Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ stat.AP
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Sequence-to-point learning with neural networks for nonintrusive load monitoring
R.I.P.
๐ป
Ghosted
Predictive Business Process Monitoring with LSTM Neural Networks
R.I.P.
๐ป
Ghosted
Forecasting: theory and practice
R.I.P.
๐ป
Ghosted
Accurate estimation of influenza epidemics using Google search data via ARGO
R.I.P.
๐ป
Ghosted
Survey of resampling techniques for improving classification performance in unbalanced datasets
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted