Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

November 01, 2024 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Evan Miller arXiv ID 2411.00640 Category stat.AP Cross-listed cs.CL Citations 71 Venue arXiv.org Last Checked 1 month ago

Abstract

Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — stat.AP

R.I.P. 👻 Ghosted

Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model

Benjamin Letham, Cynthia Rudin, ... (+2 more)

stat.AP 🏛 arXiv 📚 780 cites 10 years ago

R.I.P. 👻 Ghosted

Sequence-to-point learning with neural networks for nonintrusive load monitoring

Chaoyun Zhang, Mingjun Zhong, ... (+3 more)

stat.AP 🏛 AAAI 📚 570 cites 9 years ago

R.I.P. 👻 Ghosted

Predictive Business Process Monitoring with LSTM Neural Networks

Niek Tax, Ilya Verenich, ... (+2 more)

stat.AP 🏛 ICAISE 📚 505 cites 9 years ago

R.I.P. 👻 Ghosted

Forecasting: theory and practice

Fotios Petropoulos, Daniele Apiletti, ... (+78 more)

stat.AP 🏛 International Journal of Forecasting 📚 481 cites 5 years ago

R.I.P. 👻 Ghosted

Accurate estimation of influenza epidemics using Google search data via ARGO

Shihao Yang, Mauricio Santillana, S. C. Kou

stat.AP 🏛 PNAS 📚 360 cites 10 years ago

R.I.P. 👻 Ghosted

Survey of resampling techniques for improving classification performance in unbalanced datasets

Ajinkya More

stat.AP 🏛 arXiv 📚 236 cites 9 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, ... (+29 more)

cs.CL 🏛 NeurIPS 📚 54.2K cites 5 years ago

R.I.P. 👻 Ghosted

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, ... (+19 more)

cs.LG 🏛 NeurIPS 📚 49.7K cites 6 years ago

R.I.P. 👻 Ghosted

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, Carlos Guestrin

cs.LG 🏛 KDD 📚 49.2K cites 10 years ago

R.I.P. 👻 Ghosted

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

cs.LG 🏛 ICML 📚 46.0K cites 11 years ago