Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation
Key Moments
Language models now outperform humans on many benchmarks, but traditional evaluations like perplexity are insufficient. New methods focus on open-ended tasks and agentic capabilities, yet the challenge of "train contamination" and defining true safety remains.
Key Insights
Perplexity, a traditional metric measuring how well a language model assigns probability mass to a dataset, was the standard for evaluating language models for many years, with progress measured by its reduction.
The MMLU benchmark, introduced around the time of GPT-3, tests knowledge and reasoning across 50 subjects, initially showing large models performing significantly above chance, but has since become saturated.
Chat benchmarks like Chatbot Arena use human pairwise comparisons to rank models, offering a more realistic evaluation of open-ended responses, but are susceptible to biases and conflate style with correctness.
Agentic benchmarks like SWE-bench evaluate an agent's ability to perform tasks within a codebase, with evaluation often tied to passing unit tests, showing rapid improvement from around 16% to 93% success.
Humanity's Last Exam, a recent multimodal benchmark, shows current frontier models performing poorly, with scores in the single digits, indicating a need for continued development in complex reasoning.
Despite numerous benchmarks, defining and measuring "safety" in language models is complex, with challenges including "jailbreaking," context-dependency, and the dual-use nature of AI capabilities.
The evolution of language model evaluation: from perplexity to real-world tasks
The lecture begins by highlighting that after covering the training aspects of language models, the crucial missing piece is evaluation: how to determine if a trained model is 'good.' Initially, evaluation seemed mechanical – prompts, responses, accuracy. However, evaluation is a deep and critical topic, shaping AI development by setting 'north stars.' The core challenge lies in translating abstract desires (e.g., good conversation, reasoning) into concrete metrics and prompts. Early metrics like perplexity were natural for language models as probability distributions over tokens. These were used for decades, with progress measured by perplexity reduction on datasets like Penn Treebank and WikiText-103. The paradigm shifted with models like GPT-2, which evaluated out-of-distribution on existing benchmarks, signaling a new era of model evaluation beyond simple in-distribution perplexity.
Perplexity: the foundational metric and its limitations
Perplexity measures the probability a language model assigns to a test dataset. Minimizing perplexity is equivalent to training a model that closely matches the true underlying data distribution. This metric was central to early language modeling research, driving progress through perplexity reduction and informing scaling laws. However, perplexity has limitations. It penalizes all token prediction errors equally, potentially treating common occurrences (like the first word of a sentence) the same as crucial factual information (like a founding date). While conditional perplexity can focus on relevant tokens, and fill-in-the-blank tasks like LAMBADA and HellaSwag are essentially perplexity in disguise, the metric might not fully capture nuanced understanding or real-world utility. A critical issue with perplexity is the potential for "trust" problems in leaderboards: it's difficult to verify if submitted models are genuinely outputting valid probability distributions or if they're exploiting the system.
Exam-based benchmarks: testing knowledge and reasoning
A significant shift in evaluation involved adopting human exams as a model for benchmarking. This approach offers controlled difficulty, unambiguous answers, and ease of grading. Influential benchmarks like MMLU (Massive Multitask Language Understanding) were created to test knowledge and reasoning across diverse subjects, using techniques like few-shot prompting with GPT-3. Initially, these benchmarks showed clear progress, but they rapidly became saturated as models improved, leading to the development of harder variants and new benchmarks like GPQA (Graduate-level Google-Proof QA). These benchmarks employ rigorous creation processes, involving human experts and curated questions designed to be difficult even for humans with internet access. Despite this, models continue to saturate even these challenging tests, highlighting the continuous evolution needed in benchmark design. A key concern remains "train contamination," where benchmark questions may inadvertently appear in training data, affecting score validity.
Evaluating open-ended responses: chat benchmarks and human preference
Moving beyond multiple-choice exams, chat benchmarks address the evaluation of open-ended responses. Chatbot Arena (now Arena AI) pioneered a system where humans compare outputs from two anonymized models, using these pairwise comparisons to generate Elo rankings. This method utilizes real-world prompts from users incentivized by free model access. However, it faces challenges: the user demographic pool is unknown and potentially biased, binary preferences can conflate stylistic choices with correctness, and humans may not be ideal judges for nuanced answers. To mitigate bias and improve reliability, methods like AlpacaEval and WildBench use Language Models (LMs) as judges. AlpacaEval, for instance, uses a baseline model like GPT-4 to evaluate other models, though early versions showed bias towards longer responses. WildBench incorporates checklists or rubrics to make the evaluation task more defined, enhancing reliability whether using human or LM judges. Despite progress, evaluating open-ended responses remains an ill-defined problem, requiring careful consideration of biases and the need for structured evaluation criteria.
Agentic benchmarks: assessing models in action
Agentic benchmarks evaluate what LMs *do*, not just what they say, by assessing their capabilities within specific environments. SWE-bench is a prominent example, evaluating agents on coding tasks by assessing if their proposed pull requests pass unit tests. This benchmark has seen dramatic improvements, rising from around 16% success to over 90% for leading models. Other benchmarks like Terminal-Bench test general-purpose command-line interactions, while cybersecurity benchmarks evaluate agents' ability to perform ethical hacking tasks. ML engineering benchmarks assess agents' capabilities in data analysis and model training. A critical aspect in evaluating agents is the 'scaffold' – the logic, tools, and environment surrounding the LM. Sophisticated scaffolds now incorporate explicit planning, hierarchical delegation, and advanced memory management to improve performance. Evaluating agents thus involves assessing both the language model and the scaffold, understanding how they interact to achieve complex tasks.
Pure reasoning and safety: untangling intelligence and ethical considerations
The lecture then shifts to evaluating 'pure reasoning' by creating tasks that require fluid intelligence rather than factual knowledge. ARC AGI (Abstraction and Reasoning Corpus) tasks, designed to be 100% human-solvable but challenging for AI, initially saw models perform poorly. However, recent advancements, particularly with models like those from Open AI, have led to significant progress, with ARC AGI 1 now largely solved. This area seeks to disentangle reasoning from linguistic and world knowledge, though complete decoupling remains elusive. Separately, AI safety is a critical, albeit less defined, area of evaluation. HarmBench focuses on models refusing harmful prompts, while AirBench attempts a holistic approach by mapping regulatory frameworks to potential harms. A major challenge is 'jailbreaking,' where models trained for safety can be bypassed. Moreover, defining 'safety' is context-dependent, influenced by politics, law, and social norms, and involves diverse risks like hallucinations, bias, and enabling criminal activity. The dual-use nature of AI capabilities, such as cybersecurity agents, further complicates safety assessments.
Broader evaluative considerations: realism, validity, and ongoing challenges
Beyond specific benchmarks, broader considerations are crucial for effective evaluation. Ecological validity, or how well evaluations capture real-world use, is paramount. Traditional exams are often detached from practical application, prompting benchmarks like OpenAI's CDP (Covered Different Sectors) and medical benchmarks sourced from clinicians. These aim to assess models in realistic use-case scenarios. Data quality and auditability are also key: benchmarks can suffer from flawed questions, incomplete test cases (especially for agents), or trivial solutions, necessitating careful inspection (e.g., using tools like "Dosent"). Perhaps the most significant ongoing challenge is "train contamination" or data overlap, where models might have been trained on evaluation data, invalidating results. Strategies to combat this include checking for ordering preferences in model outputs, encouraging transparency in reporting train-test overlap, defining fresh evaluations using post-cutoff data, and employing private, internal datasets. Ultimately, the purpose of evaluation dictates the chosen methods; whether for purchasing decisions, research, policy, or model improvement, clarity on goals is essential.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Common Questions
The main goal of language model evaluation is to determine how 'good' a trained model is. This involves translating abstract desired behaviors like conversation or reasoning into concrete, measurable metrics.
Topics
Mentioned in this video
A platform for evaluating language models through pairwise comparisons rated by humans, formerly known as Chapot Arena. It uses Elo rankings to determine model performance.
The current name for Chatbot Arena, a platform used for publicly ranking language models based on human preference through pairwise comparisons.
A tool that uses language models to inspect agent traces and detect problems, offering a qualitative approach to benchmarking.
Mentioned as a benchmark for increasing model performance over time, with scores around 39% on GPQA initially and later reaching 94%.
A benchmark that uses LLMs as judges with a checklist or rubric to evaluate prompts and responses, aiming for more defined evaluations.
Mentioned in the context of benchmark saturation, showing high scores on MMLU around early 2024.
Mentioned in the context of analyzing user data to understand what people use it for. Not otherwise substantively discussed.
A benchmark that uses LLMs as judges to evaluate model responses, initially favoring longer outputs but later debiased with regression methods.
A benchmark that uses a computer terminal as the environment for agents to perform general-purpose tasks, with tasks crowdsourced globally.
A type of fresh evaluation that scrapes new web pages or GitHub repositories to create evaluations past the training cutoff date of language models.
A benchmark for evaluating agentic capabilities, specifically for coding tasks, where agents submit PRs to fix GitHub issues and are evaluated by passing unit tests.
A benchmark designed to isolate reasoning from knowledge, presenting visual tasks that are challenging for AI but solvable by humans.
A type of fresh evaluation that scrapes new web pages or GitHub repositories to create evaluations past the training cutoff date of language models.
A benchmark focused on safety, testing if models refuse to generate harmful content when prompted.
A platform where PhD contractors were hired to create questions for the GPQA benchmark.
Mentioned multiple times in relation to benchmarks like GPT-2, GPT-3, and their models mentioned in benchmarks such as MMLU and GPQA. Also mentioned for creating the GDP 'benchmark'.
A website that ranks language models based on a measure of intelligence, often used as a standard for evaluating model capabilities.
More from Stanford Online
View all 52 summaries
101 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training
83 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
78 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws
69 minStanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free