Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

November 01, 2024 ยท Declared Dead ยท ๐Ÿ› arXiv.org

๐Ÿ‘ป CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Evan Miller arXiv ID 2411.00640 Category stat.AP Cross-listed cs.CL Citations 71 Venue arXiv.org Last Checked 1 month ago
Abstract
Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” stat.AP

R.I.P. ๐Ÿ‘ป Ghosted

Forecasting: theory and practice

Fotios Petropoulos, Daniele Apiletti, ... (+78 more)

stat.AP ๐Ÿ› International Journal of Forecasting ๐Ÿ“š 481 cites 5 years ago

Died the same way โ€” ๐Ÿ‘ป Ghosted