What is perplexity and how is it used in language model evaluation?

Perplexity is a core metric for evaluating language models, measuring how well a model assigns probability mass to a given dataset. Minimizing perplexity is seen as ideal, as it approaches the true data distribution.

How did the evaluation paradigm for language models change after GPT-2?

After GPT-2, the evaluation paradigm shifted from 'in-distribution' evaluation (train and test on the same data split) to out-of-distribution evaluation, where models trained on large web text were evaluated on standard benchmarks.

What are some of the prominent exam-based benchmarks for LLMs?

Prominent exam-based benchmarks include MMLU (Massive Multitask Language Understanding), GPQA (Google Proof QA), and Humanity's Last Exam, which test knowledge, reasoning, and problem-solving across various subjects.

What are the limitations of exam-based benchmarks for LLMs?

Exam-based benchmarks often fail to capture real-world usage because most people don't ask multiple-choice questions. They also don't reflect the open-ended, sometimes ambiguous queries users typically make.

How does Chatbot Arena evaluate language models?

Chatbot Arena uses pairwise comparisons where humans rate responses from two anonymized models. These comparisons are used to compute Elo rankings, providing a user-preference-based evaluation.

What are agentic benchmarks, and what do they measure?

Agentic benchmarks evaluate language models' ability to act as agents, performing tasks in an environment using tools. Examples include SWE-bench (coding), Terminal-bench (terminal commands), and cybersecurity capture-the-flag tasks.

What is the goal of 'pure reasoning' benchmarks like ARC AGI?

The goal of pure reasoning benchmarks is to isolate and measure a model's fluid intelligence and reasoning capabilities, independent of linguistic and world knowledge, by presenting tasks that are novel and require pattern recognition.

What are the primary concerns in AI safety evaluation?

AI safety evaluation addresses issues like preventing models from generating harmful content (HarmBench), considering holistic risks across regulatory frameworks (AirBench), and mitigating 'jailbreaking' attempts.

What is train contamination in LLM evaluation, and why is it a problem?

Train contamination occurs when benchmark test data is present in the model's training set, leading to artificially inflated performance scores. This is a significant problem for models trained on vast internet data, making evaluation less reliable.

What are some strategies to address train contamination in LLM evaluation?

Strategies include inferring if a model has seen the test data (e.g., analyzing question order), encouraging reporting of train-test overlap, defining fresh evaluations on new data, and using private, internal datasets.

Why is data set quality crucial for accurate LLM evaluation?

Poor data set quality, including broken questions, ambiguous tasks, or incomplete test cases, can lead to misleading evaluation results. Auditing benchmarks and inspecting model outputs are essential for ensuring quality measurement.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

Stanford Online

Education6 min read79 min video

May 19, 2026|73 views|8|1

Stanford Stanford Online AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Language models now outperform humans on many benchmarks, but traditional evaluations like perplexity are insufficient. New methods focus on open-ended tasks and agentic capabilities, yet the challenge of "train contamination" and defining true safety remains.

Key Insights

Perplexity, a traditional metric measuring how well a language model assigns probability mass to a dataset, was the standard for evaluating language models for many years, with progress measured by its reduction.

The MMLU benchmark, introduced around the time of GPT-3, tests knowledge and reasoning across 50 subjects, initially showing large models performing significantly above chance, but has since become saturated.

Chat benchmarks like Chatbot Arena use human pairwise comparisons to rank models, offering a more realistic evaluation of open-ended responses, but are susceptible to biases and conflate style with correctness.

Agentic benchmarks like SWE-bench evaluate an agent's ability to perform tasks within a codebase, with evaluation often tied to passing unit tests, showing rapid improvement from around 16% to 93% success.

Humanity's Last Exam, a recent multimodal benchmark, shows current frontier models performing poorly, with scores in the single digits, indicating a need for continued development in complex reasoning.

Despite numerous benchmarks, defining and measuring "safety" in language models is complex, with challenges including "jailbreaking," context-dependency, and the dual-use nature of AI capabilities.

The evolution of language model evaluation: from perplexity to real-world tasks

The lecture begins by highlighting that after covering the training aspects of language models, the crucial missing piece is evaluation: how to determine if a trained model is 'good.' Initially, evaluation seemed mechanical – prompts, responses, accuracy. However, evaluation is a deep and critical topic, shaping AI development by setting 'north stars.' The core challenge lies in translating abstract desires (e.g., good conversation, reasoning) into concrete metrics and prompts. Early metrics like perplexity were natural for language models as probability distributions over tokens. These were used for decades, with progress measured by perplexity reduction on datasets like Penn Treebank and WikiText-103. The paradigm shifted with models like GPT-2, which evaluated out-of-distribution on existing benchmarks, signaling a new era of model evaluation beyond simple in-distribution perplexity.

Perplexity: the foundational metric and its limitations

Perplexity measures the probability a language model assigns to a test dataset. Minimizing perplexity is equivalent to training a model that closely matches the true underlying data distribution. This metric was central to early language modeling research, driving progress through perplexity reduction and informing scaling laws. However, perplexity has limitations. It penalizes all token prediction errors equally, potentially treating common occurrences (like the first word of a sentence) the same as crucial factual information (like a founding date). While conditional perplexity can focus on relevant tokens, and fill-in-the-blank tasks like LAMBADA and HellaSwag are essentially perplexity in disguise, the metric might not fully capture nuanced understanding or real-world utility. A critical issue with perplexity is the potential for "trust" problems in leaderboards: it's difficult to verify if submitted models are genuinely outputting valid probability distributions or if they're exploiting the system.

Exam-based benchmarks: testing knowledge and reasoning

A significant shift in evaluation involved adopting human exams as a model for benchmarking. This approach offers controlled difficulty, unambiguous answers, and ease of grading. Influential benchmarks like MMLU (Massive Multitask Language Understanding) were created to test knowledge and reasoning across diverse subjects, using techniques like few-shot prompting with GPT-3. Initially, these benchmarks showed clear progress, but they rapidly became saturated as models improved, leading to the development of harder variants and new benchmarks like GPQA (Graduate-level Google-Proof QA). These benchmarks employ rigorous creation processes, involving human experts and curated questions designed to be difficult even for humans with internet access. Despite this, models continue to saturate even these challenging tests, highlighting the continuous evolution needed in benchmark design. A key concern remains "train contamination," where benchmark questions may inadvertently appear in training data, affecting score validity.

Evaluating open-ended responses: chat benchmarks and human preference

Moving beyond multiple-choice exams, chat benchmarks address the evaluation of open-ended responses. Chatbot Arena (now Arena AI) pioneered a system where humans compare outputs from two anonymized models, using these pairwise comparisons to generate Elo rankings. This method utilizes real-world prompts from users incentivized by free model access. However, it faces challenges: the user demographic pool is unknown and potentially biased, binary preferences can conflate stylistic choices with correctness, and humans may not be ideal judges for nuanced answers. To mitigate bias and improve reliability, methods like AlpacaEval and WildBench use Language Models (LMs) as judges. AlpacaEval, for instance, uses a baseline model like GPT-4 to evaluate other models, though early versions showed bias towards longer responses. WildBench incorporates checklists or rubrics to make the evaluation task more defined, enhancing reliability whether using human or LM judges. Despite progress, evaluating open-ended responses remains an ill-defined problem, requiring careful consideration of biases and the need for structured evaluation criteria.

Agentic benchmarks: assessing models in action

Agentic benchmarks evaluate what LMs *do*, not just what they say, by assessing their capabilities within specific environments. SWE-bench is a prominent example, evaluating agents on coding tasks by assessing if their proposed pull requests pass unit tests. This benchmark has seen dramatic improvements, rising from around 16% success to over 90% for leading models. Other benchmarks like Terminal-Bench test general-purpose command-line interactions, while cybersecurity benchmarks evaluate agents' ability to perform ethical hacking tasks. ML engineering benchmarks assess agents' capabilities in data analysis and model training. A critical aspect in evaluating agents is the 'scaffold' – the logic, tools, and environment surrounding the LM. Sophisticated scaffolds now incorporate explicit planning, hierarchical delegation, and advanced memory management to improve performance. Evaluating agents thus involves assessing both the language model and the scaffold, understanding how they interact to achieve complex tasks.

Pure reasoning and safety: untangling intelligence and ethical considerations

The lecture then shifts to evaluating 'pure reasoning' by creating tasks that require fluid intelligence rather than factual knowledge. ARC AGI (Abstraction and Reasoning Corpus) tasks, designed to be 100% human-solvable but challenging for AI, initially saw models perform poorly. However, recent advancements, particularly with models like those from Open AI, have led to significant progress, with ARC AGI 1 now largely solved. This area seeks to disentangle reasoning from linguistic and world knowledge, though complete decoupling remains elusive. Separately, AI safety is a critical, albeit less defined, area of evaluation. HarmBench focuses on models refusing harmful prompts, while AirBench attempts a holistic approach by mapping regulatory frameworks to potential harms. A major challenge is 'jailbreaking,' where models trained for safety can be bypassed. Moreover, defining 'safety' is context-dependent, influenced by politics, law, and social norms, and involves diverse risks like hallucinations, bias, and enabling criminal activity. The dual-use nature of AI capabilities, such as cybersecurity agents, further complicates safety assessments.

Broader evaluative considerations: realism, validity, and ongoing challenges

Beyond specific benchmarks, broader considerations are crucial for effective evaluation. Ecological validity, or how well evaluations capture real-world use, is paramount. Traditional exams are often detached from practical application, prompting benchmarks like OpenAI's CDP (Covered Different Sectors) and medical benchmarks sourced from clinicians. These aim to assess models in realistic use-case scenarios. Data quality and auditability are also key: benchmarks can suffer from flawed questions, incomplete test cases (especially for agents), or trivial solutions, necessitating careful inspection (e.g., using tools like "Dosent"). Perhaps the most significant ongoing challenge is "train contamination" or data overlap, where models might have been trained on evaluation data, invalidating results. Strategies to combat this include checking for ordering preferences in model outputs, encouraging transparency in reporting train-test overlap, defining fresh evaluations using post-cutoff data, and employing private, internal datasets. Ultimately, the purpose of evaluation dictates the chosen methods; whether for purchasing decisions, research, policy, or model improvement, clarity on goals is essential.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

The main goal of language model evaluation is to determine how 'good' a trained model is. This involves translating abstract desired behaviors like conversation or reasoning into concrete, measurable metrics.

Topics

Ai Safety AI & Machine Learning Technology & Innovation LLM Evaluation Agentic AI Benchmark Design Task-based Evaluation Model Assessment

Mentioned in this video

Software & Apps

Chatbot Arena

A platform for evaluating language models through pairwise comparisons rated by humans, formerly known as Chapot Arena. It uses Elo rankings to determine model performance.

Arena AI

The current name for Chatbot Arena, a platform used for publicly ranking language models based on human preference through pairwise comparisons.

DoSent

A tool that uses language models to inspect agent traces and detect problems, offering a qualitative approach to benchmarking.

GPT-4

Mentioned as a benchmark for increasing model performance over time, with scores around 39% on GPQA initially and later reaching 94%.

WildBench

A benchmark that uses LLMs as judges with a checklist or rubric to evaluate prompts and responses, aiming for more defined evaluations.

GPT-3.5 Turbo

Mentioned in the context of benchmark saturation, showing high scores on MMLU around early 2024.

Claude

Mentioned in the context of analyzing user data to understand what people use it for. Not otherwise substantively discussed.

AlpacaEval

A benchmark that uses LLMs as judges to evaluate model responses, initially favoring longer outputs but later debiased with regression methods.

Terminal Bench

A benchmark that uses a computer terminal as the environment for agents to perform general-purpose tasks, with tasks crowdsourced globally.

Live CodeBench

A type of fresh evaluation that scrapes new web pages or GitHub repositories to create evaluations past the training cutoff date of language models.

SWE-Bench

A benchmark for evaluating agentic capabilities, specifically for coding tasks, where agents submit PRs to fix GitHub issues and are evaluated by passing unit tests.

ARC AGI

A benchmark designed to isolate reasoning from knowledge, presenting visual tasks that are challenging for AI but solvable by humans.

Untreatable Eval

A type of fresh evaluation that scrapes new web pages or GitHub repositories to create evaluations past the training cutoff date of language models.

HarmBench

A benchmark focused on safety, testing if models refuse to generate harmful content when prompted.

Concepts

Humanity's Last Exam

A benchmark created to challenge models with multimodal, multi-subject questions, aiming to be extremely difficult and using a private held-out set to mitigate training contamination.

Companies

Upwork

A platform where PhD contractors were hired to create questions for the GPQA benchmark.

OpenAI

Mentioned multiple times in relation to benchmarks like GPT-2, GPT-3, and their models mentioned in benchmarks such as MMLU and GPQA. Also mentioned for creating the GDP 'benchmark'.

Artificial Analysis

A website that ranks language models based on a measure of intelligence, often used as a standard for evaluating model capabilities.

Organizations

LMSys

Organization behind Chatbot Arena, mentioned for their work on evaluating LLMs through pairwise comparisons.

People

Hendrycks et al.

Associated with the MMLU benchmark (Massive Multitask Language Understanding) from 2020, which tests knowledge and reasoning across many subjects.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free