What is the difference between ME, TR, and threat research in METR?

ME (Model Evaluation) assesses what models can do now and in the near future. TR (Threat Research) connects those capabilities and propensities to potential threat models. Threat research is presented as the longer-term, structured analysis that evaluates existential or catastrophic risks.

What is the origin story of the Time Horizon chart?

The Time Horizon chart began as a PowerPoint concept to illustrate how capabilities might improve over time and with compute. After gathering data on task difficulty (as measured by the time humans take to complete tasks with 50% reliability), the chart turned out to be surprisingly linear, which surprised the team.

Why was Opus 4.5 considered a big jump in abilities?

Opus 4.5 was highlighted as a significant leap in benchmark performance, which also disrupted the previously observed trend line. The hosts and Joel discuss how this jump challenged expectations about model progress and what it implies for future predictions.

What is the purpose of meter's task distributions like SWAR, HCOS, and AR bench?

SWAR, HCOS, and AR bench are different tiers of tasks used to probe model capabilities—from small, atomic tasks to more complex, autonomous and lengthy tasks. They’re designed to be economically valuable and progressively harder, enabling a structured evaluation of progress over time.

What is the AI Village and why is it relevant?

AI Village is a concept where a village of agents pursues open-ended goals (like organizing an event or setting up a shop). It’s used to study how models handle open-ended tasks and long-horizon planning beyond fixed benchmarks, including how they learn and adapt in less structured environments.

What does ‘open-endedness’ mean in this context?

Open-endedness refers to models pursuing goals that aren’t narrowly defined by a fixed benchmark, exploring tasks with broader aims and real-world applications. It’s discussed as a way to study how models could automate more complex processes like R&D, beyond discrete test tasks.

How does meter view the relationship between time horizon and compute growth?

Meter theorizes that time horizon depends on both algorithmic progress and compute. If compute growth slows, both the actual capability gains and the pace of progress could slow, though there are caveats about potential non-linear progress or capabilities explosions.

What is the role of transcripts as a data source for evaluating models?

Transcripts capture model actions and outputs in real deployments, providing a rich, real-world data source on model behavior, including how they handle safety and unexpected tasks. This live data is seen as valuable for understanding capabilities and risks beyond controlled benchmarks.

Are Meter's hiring plans and culture discussed in the interview?

Yes, Meter mentions ongoing hiring for research engineers, researchers, and an operations director, emphasizing a balance of internal rigor and transparent communication. They seek people who can contribute to frontier science while maintaining level-headed reporting of results.

What is the karaoke angle in the interview?

Towards the end, Joel talks about hosting live band karaoke events and the value of social, creative activities, illustrating a personal dimension beyond technical work.

Key Moments

Measuring Exponential Trends Rising (in AI) — Joel Becker, METR

Latent Space Podcast

Science & Technology6 min read66 min video

Feb 27, 2026|5,727 views|97|6

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

METR maps AI capabilities to threat models; time horizons, tasks, and compute drive progress.

Key Insights

METR breaks down into model evaluation (capabilities and propensities) and threat research, linking what AI can do to how dangerous it might be.

The time horizon concept shows task difficulty as a function of human effort over time, highlighting a surprisingly steady progression rather than abrupt leaps.

Task selection aims for economically valuable, scalable tests; tasks are curated for measurability and real-world relevance, not just novelty.

Opus 4.5 caused a major benchmark jump that challenged prior trend lines and shifted expectations about how fast capabilities can grow.

Developer productivity studies reveal complex, nuanced speedups; measuring uplift is confounded by concurrency, task selection, and changing workflows.

Compute and algorithmic progress are tightly intertwined; potential slowdown in compute growth could slow progress, but autonomous R&D raises existential questions that warrant independent threat research.

WHAT IS METR AND WHY IT MATTERS

METR stands for a three-part framework designed to grapple with AI risk in a structured way. The first two letters, ME TR, refer to model evaluation: considering what AI models can do today and what they might achieve tomorrow, as well as their propensities—how they might behave in real-world deployment given their capabilities. The final two letters, threat research, describe the effort to connect these capabilities and propensities to specific threat models so researchers can assess whether AI models pose serious, even catastrophic, risks to society. The conversation emphasizes that METR is not merely a performance metric; it is an analytic approach meant to surface both what models can do and how they might misbehave, and then put those insights into threat-specific contexts. This framing helps separate capability from risk, enabling more disciplined debate about when, where, and how large a threat a given model might pose. The host and guest also stress the importance of maintaining an independent perspective, especially in a field where incentives can color risk narratives. In short, METR seeks to translate raw benchmark scores into actionable risk assessment by tying capabilities to concrete threat models, with the aim of informing civil society and policy discussions while guiding responsible development.

FROM CAPABILITIES TO THREATS: THREAT MODELS AND RISK

A central theme in the discussion is how to translate observed capabilities into meaningful threat assessments. The team has evolved their threat models to reflect what might be dangerous in realistic timelines, distinguishing between capabilities that are technically possible and the conditions under which they could cause societal harm. For example, they discuss the shift away from focusing solely on autonomous replication as the dominant worry to considering acceleration risks inside labs (A&D acceleration) and the possibility of a rapid capabilities explosion under certain conditions. They acknowledge that threat research is still maturing and that its conclusions depend on the balance between capabilities evidence and protective measures in place. The conversation also touches on classic thought-experiments (e.g., the “paperclip factory”) to illuminate how even small knobs in incentive structures or deployment environments could drive large, unintended consequences. Overall, threat models are treated as dynamic tools that adapt as capabilities grow, rather than as fixed thresholds, with the aim of calibrating risk without overhyping imminent catastrophes.

THE TIME HORIZON STORY: ORIGINS, TASKS, AND INTERPRETATION

Time horizon is the centerpiece of METR’s empirical approach. Its origin lies in a 2023 internal METR PowerPoint that plotted autonomous capabilities against an axis of resources and time, producing a scatter that gradually evolved into a surprisingly straight line when using a more robust measure: the task difficulty as the length of time a human would need to complete the task with 50% reliability. The panel discusses the interpretation of this trend as a reflection of underlying fundamental progress rather than a simple speedup in clock time. Tasks are carefully selected through a combination of internal curation and external bounties to ensure they are economically valuable for autonomy and threat modeling, while remaining automatically gradable where possible. The team emphasizes that tasks should be realistically solvable by models given adequate information, avoiding open-ended, messy real-world tasks that would confound measurement. They also highlight that their task set is not a random sample and that certain domains (e.g., vision-heavy tasks) may be underrepresented due to current evaluation constraints. The 170-task catalog, including SWAR, HCLASS, AR bench, and more, provides a structured ladder from simple atomic actions to complex, autonomous workflows, helping translate model progress into interpretable milestones for governance and planning.

THE OPUS 4.5 JUMP: IMPLICATIONS FOR BENCHMARKS AND BELIEFS

One of the most striking moments in the conversation is the discussion of Opus 4.5—a release that produced a noticeable jump in benchmark performance and triggered a re-evaluation of prior trend lines. The guests reflect on how 4.5’s leap aligned with or disrupted existing expectations about the pace of progress, and how such a discontinuity can complicate predictions that are otherwise grounded in long-term trends. They caution that a single data point does not erase the value of a trend, but it can re-center beliefs about the velocity of improvement, especially when it translates into productivity shifts (e.g., developers coding faster or moving toward more autonomous coding workflows). The conversation also notes the broader implications for benchmarking: a sudden improvement can expose blind spots in older evaluation methods and accelerate the adoption of new programming paradigms (such as agentic coding) among engineers. They underscore the importance of revalidating uplift studies with newer models while being mindful of practice changes, such as concurrent task handling and evolving workflows that complicate experimental design.

DEVELOPER PRODUCTIVITY: EVIDENCE, LIMITS, AND DESIGN CHOICES

A substantial portion of the dialogue centers on developer productivity studies, particularly the tension between observed speedups and the difficulty of measuring them robustly. The conversation grapples with the idea that AI can dramatically accelerate coding, yet real-world gains are tempered by factors like concurrent work, task selection biases, and the evolving nature of developer workflows. The participants discuss the challenge of re-running uplift studies with newer models: AI-enabled productivity is harder to quantify when developers juggle multiple tasks, switch contexts, or opt into AI-enabled workflows only for the most lucrative problems. They also address concerns about selection effects, such as tasks being chosen because they are clearly bottlenecks, or because developers might push back against AI-disallowed tasks in experimental setups. Anecdotal evidence and interviews with engineers are acknowledged as useful, but they stress the necessity of careful experimental design to avoid cherry-picking and to build credible, repeatable measurements over time. The dialogue also reflects on the broader implication: even when speedups exist, translating them into equivalent business value is nontrivial due to organizational constraints and product pressures.

COMPUTE, PROGRESS, AND THE FEAR OF TAKEOFF

A core theme connects compute growth to algorithmic progress and the potential for takeoff. The speakers outline a view in which progress is, to a surprising degree, tied to compute and the capacity to run large-scale experiments that discover new architectures, learning schedules, and training paradigms. They discuss both industry-wide and lab-level perspectives, drawing on OpenAI data and projections to illustrate how dollars spent on compute translate into capability gains. The conversation acknowledges that a slower growth rate in compute could slow down advances, but also considers the possibility of capability explosions if research and development loops become more automated. They distinguish between software-only progress and scenarios requiring hardware innovations, like chip design and production, and emphasize that closing feedback loops (fully automated R&D) could be destabilizing if not carefully managed. The dialogue also touches on the reality that benchmarks and time horizons are only approximate guides; the actual landscape may include abrupt discontinuities or hidden capabilities emerging in ways that are difficult to anticipate. Finally, they advocate for independent threat research and diversified measurement to avoid lab-driven mythmaking about imminent risks while maintaining vigilance against plausible worst-case scenarios.

Mentioned in This Episode

●Products

●Software & Apps

●Tools

●Companies

●People Referenced

Common Questions

METR stands for Model Evaluation, Time horizon, and Threat Research. It’s a framework meter uses to organize how they evaluate AI models’ capabilities, what they can do now versus tomorrow, and how those capabilities map to potential threats. The explanation appears at the start of the interview asmeter is introduced.

Mentioned in this video

Software & Apps

SWAR

A list of atomic tasks used in meter’s evaluations, including small file-handling and password-related tasks.

Paper bench

Benchmark track for reproducing or producing papers as part of broader ML self-improvement benchmarks.

HCOS

A tier of tasks spanning from small challenges to 20–30 hour autonomous, sequential tasks.

REBench

A benchmark suite consisting of small, atomic software tasks used by meters to evaluate model behavior.

GPT-5 (GP5)

Meter references GP5/GPT-5 as a benchmark reference point for capabilities.

Opus 4.5

A major model update (Opus 4.5) that caused a notable jump in capabilities and challenged prior trend lines.

Yakun

An ML self-improvement benchmark/workstream referenced in comparison to meter’s tasks.

HCLASS

Meter’s internal set of tasks (private) used alongside SWAR; part of their evaluation suite.

AR bench

Atomic task benchmarks and more challenging autonomous tasks used by meter’s evaluations.

Concepts

Karaoke

Live band karaoke events hosted by Joel; discussed as a personal/cultural activity.

Stargate

Compute spend reference point used in discussing OpenAI's and others’ compute expenditures and future projections.

People

Quentyn Anthony

Quentyn Anthony, participant in meter’s developer productivity study discussions.

Dylan (SemiAnalysis)

Dylan from Semi Analysis referenced in discussion about time horizon and compute progress.

Noam Brown

Noam Brown referenced in conversation about open-endedness and multi-agent cooperation.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Measuring Exponential Trends Rising (in AI) — Joel Becker, METR

Want to know something specific about what's covered?

Key Insights

WHAT IS METR AND WHY IT MATTERS

FROM CAPABILITIES TO THREATS: THREAT MODELS AND RISK

THE TIME HORIZON STORY: ORIGINS, TASKS, AND INTERPRETATION

THE OPUS 4.5 JUMP: IMPLICATIONS FOR BENCHMARKS AND BELIEFS

DEVELOPER PRODUCTIVITY: EVIDENCE, LIMITS, AND DESIGN CHOICES

COMPUTE, PROGRESS, AND THE FEAR OF TAKEOFF

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from Latent Space

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

Devin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray

🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind

Ask anything from this episode.