Building GenAI Benchmarks: A Case Study in Legal Applications

Dan Ho | Julian Nyarko | Neel Guha | with Christopher Ré
The Oxford Handbook of the Foundations and Regulation of Generative AI, June 2025

The promise of Generative AI for specialized fields like law has created a pressing need for domain-specific benchmarks to evaluate performance, reliability, and safety. This chapter explores the construction of such benchmarks through the lens of legal applications. It first introduces the concept of benchmarking and its important role in assessing AI systems. It then examines the unique challenges posed by legal benchmarking, including evaluating unstructured text, cost constraints, training data leakage, and subjective labeling. The chapter concludes by highlighting how benchmark development can serve as a catalyst for interdisciplinary collaboration between legal experts and AI researchers. As GenAI becomes increasingly embedded in high-stakes domains, robust benchmarking will be essential to ensure accountability, enable informed governance, and steer technical progress toward socially beneficial ends.