The Missing Science of AI Evaluation
AI evaluations inform critical decisions, from the valuations of trillion-dollar companies to policies on regulating AI. Yet, evaluation methods have failed to keep pace with deployment, creating an evaluation crisis where performance in the lab fails to predict real-world utility.
In this talk, I will discuss the evaluation crisis in a high-stakes domain: AI-based science. Across dozens of fields, from medicine to political science, I find that flawed evaluation practices have led to overoptimistic claims about AI’s accuracy, affecting hundreds of published papers. To address these evaluation failures, I present a consensus-based checklist that identifies common pitfalls and consolidates best practices for researchers adopting AI, and a benchmark to foster the development of AI agents that can verify scientific reproducibility.
AI evaluation failures affect several other applications. Beyond science, I examine how AI agent benchmarks miss many failure modes and present systems to identify these errors. I also discuss how better AI evaluation can inform policymaking, drawing on my work on open foundation models and my engagement with state and federal agencies.
Why does the evaluation crisis persist? The AI community has poured enormous resources into building evaluations for models, but not into investigating how models impact the world. To address the crisis, we need to build a systematic science of AI evaluation to bridge the gap between benchmark performance and real-world impact.
This lecture will also be live streamed via Zoom. You are welcome to join us either in South Hall or online.
For online participants
Online participants must have a Zoom account and be logged in. Sign up for your free account here. If this is your first time using Zoom, please allow a few extra minutes to download and install the browser plugin or mobile app.
Speaker
Sayash Kapoor
Sayash Kapoor is a Jacobus Fellow and a computer science Ph.D. candidate at Princeton University. He is a coauthor of AI Snake Oil, one of Nature’s 10 best books of 2024. His newsletter AI as Normal Technology is read by over 70,000 AI enthusiasts, researchers, and policymakers. His work has been published in leading scientific journals such as Science and Nature Human Behavior and conferences like ICLR, ICML, and NeurIPS. He has written for mainstream outlets including The Wall Street Journal and WIRED and his work has been featured in The New York Times, The Atlantic, Washington Post, Bloomberg, and many others. Kapoor has been recognized with various awards, including a best paper award at ACM FAccT, an impact recognition award at ACM CSCW, a Privacy Papers for Policymakers award, and inclusion in TIME’s inaugural list of the 100 most influential people in AI.
