The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

Angelina Wang | Daniel E. Ho | Sanmi Koyejo
ArXiV, September 2025

Standard offline evaluations for language models — a series of independent, state-less inferences made by models — fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user’s chat session, or in a different user’s chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.