Terminal-Bench: Pushing Claude Code, OpenAI Codex, Factory Droid, et al to the limits

Latent Space PodcastLatent Space Podcast
Science & Technology3 min read36 min video
Oct 18, 2025|3,554 views|72|6
Save to Pod

Key Moments

TL;DR

Terminal-Bench: a coding agent benchmark focused on terminal interaction.

Key Insights

1

Terminal-based interaction is favored over GUI for AI agents due to text's native compatibility with LLMs and efficiency.

2

Terminal-Bench began before advanced coding agents like Claude Code, but its flexible design accommodated them.

3

The benchmark is designed for future agent capabilities, not just current uses, encompassing non-coding tasks.

4

Terminal-Bench aims to be a framework for creating, hosting, and distributing benchmarks, potentially for internal company use.

5

Evaluating agents requires multi-dimensional metrics beyond accuracy, including cost, latency, and economic value.

6

Separating agent performance from model performance is crucial for understanding true AI advancement.

THE ORIGINS AND VISION OF TERMINAL-BENCH

Alex Shaw and Mike Merrill, creators of Terminal-Bench, discuss its genesis from a $1 million prize on Sweetbench to an industry standard for coding agent benchmarks. Initially driven by a desire for more autonomy and risk-taking, the project evolved. The core idea, influenced by Andy Conwinsky, is that the terminal is the ultimate tool for arbitrary computer tasks. This led to the development of Terminal-Bench as a broader abstraction of Sweetbench, focusing on any task completable via code in a terminal, not just GitHub repositories and pull requests.

THE STRATEGIC ADVANTAGE OF TERMINAL INTERACTION

The decision to focus on terminal-based interaction over GUI was strategic. Text is the native modality for large language models, making terminal interfaces the most performant way for models to reason and execute tasks. Unlike GUIs, which are often designed for simplicity and can involve many clicks for complex operations, a single terminal command can achieve the same result. This efficiency and directness make the terminal an ideal environment for AI agents to interact with computer systems.

DESIGNING FOR THE FUTURE OF AGENTS

Terminal-Bench tasks are defined by an instruction, a container environment, and a test script, allowing for a wide range of operations beyond traditional coding. While coding tasks form a majority, the benchmark includes games, mathematics, and even non-coding applications like automating email drafts or journal entries. The design anticipates the future capabilities of agents, aiming to evaluate them on tasks requiring complex, multi-step reasoning and execution, rather than just present-day usage patterns.

EVOLVING BENCHMARKS AND EVALUATION METHODOLOGIES

Terminal-Bench has adapted existing benchmarks and aims to be a comprehensive evaluation harness. The creators emphasize the difficulty of creating high-quality, unique tasks, which cannot simply be scraped from online sources or easily generated by language models. The future of evaluation, they suggest, lies in observing real-world job tasks and translating them into benchmark formats, requiring more specialization and acknowledging that current benchmarks might be addressing the 'low-hanging fruit'.

AGENT VS. MODEL PERFORMANCE DISTINCTION

To truly assess underlying AI capabilities, it's crucial to distinguish between the performance of the model itself and the agent framework used. Terminus, Terminal-Bench's own research preview agent, is designed to be simple and unopinionated, using only a headless terminal for interaction. This minimal setup helps isolate the model's capabilities by avoiding extensive tool development or context management optimizations that might be present in lab-specific agents like Claude Code or Codex CLI.

THE MULTI-DIMENSIONAL FUTURE OF EVALUATION

Beyond simple accuracy, future evaluations will increasingly incorporate multiple dimensions, such as cost, latency, and economic value. The ultimate eval will likely measure the actual money an agent makes or saves when deployed for real-world tasks, like operating as a software engineer or managing investments. This shift acknowledges that simply getting tasks right isn't enough; agents must be efficient and deliver tangible economic benefits, moving away from one-dimensional charts to more comprehensive, profit-and-loss assessments.

TERMINAL-BENCH AS A FRAMEWORK AND ROADMAP

Terminal-Bench is evolving beyond just a benchmark into a robust framework for interactive agent benchmarking and reinforcement learning post-training. The roadmap includes making it easier to create tasks, synthesize tasks from data, run evaluations at scale in the cloud, and version/distribute benchmarks. They aim to provide a superior developer experience, allowing users to bring their agents and have a one-stop shop for evaluation, and even envision its use for private, internal evaluation sets and RL training.

Common Questions

Terminal Bench is a benchmark designed to evaluate AI agents' capabilities in performing tasks within a terminal environment. It was created by Alex and Mike from Turtlebench to provide a more robust and terminal-focused evaluation framework compared to existing benchmarks, inspired by the potential of text-based AI interaction.

Topics

Mentioned in this video

softwareTCLM

A model designed by Jeffrey Lee for text classification, likely related to the 'dataccom for language models' paper.

organizationAIE code

An event or conference where the guests are invited to speak further.

softwareTerminal Bench

A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.

personAndy Conwinsky

Co-founder of Databricks and Perplexity, involved in founding LA Institute and Law Ventures.

personJeffrey Lee

A contributor to Terminal Bench and a PhD student of Ludwig.

softwareOpen Hands

An agent mentioned in the context of specialized agents.

conceptRL post training

A framework Terminal Bench is evolving into, allowing for post-training models using reinforcement learning.

personMike Merrill

A postdoc at Stanford working with Ludwig Schmidt, and a creator of Terminal Bench.

productEC2 instance

An example used to illustrate the inefficiency of GUI-based systems compared to terminal commands.

softwareFactory Droids

An example of agent labs using various models to build their own agents.

personNicholas Carlini

An individual at Anthropic who alerted Alex and Mike to their mention on the Claw 4 model card and contributed tasks.

studySweet Bench Verify

A benchmark that has been adapted for Terminal Bench.

conceptK prize

A $1 million prize on Sweetbench that Alex initially worked on.

softwareCodec CLI

An agent from OpenAI, mentioned as an example of specialized agents.

organizationLAI

Law Institute, where Alex works, with a culture of shipping research into usable products.

organizationTurtlebench

The organization Alex and Mike are from, creators of Terminal Bench.

organizationLA Institute

An organization where Alex works and Andy Conwinsky is involved.

softwareClaw 4

An AI model mentioned in the Claw 4 model card, which listed Terminal Bench.

conceptDNA sequences

A biological problem addressed by a Terminal Bench task, involving assembly of sequences.

softwareTerminus

Turtlebench's own research preview agent designed for simple and unopinionated testing.

companyFactory AI

Company whose t-shirt Allesio is wearing, claiming top performance on Terminal Bench.

organizationLaw Ventures

An organization co-founded by Andy Conwinsky.

studySwans

A benchmark mentioned as not yet adapted for Terminal Bench.

organizationBYU

University where Alex studied math and computer science.

studydataccom for language models

A previous paper by Jeffrey Lee and Ludwig where the need for designing a fast text classifier arose.

conceptRL

Reinforcement learning, mentioned in relation to how Anthropic might optimize agents like Claude Code.

personLudwig Schmidt

Mike Merrill's collaborator and PhD advisor at Stanford.

personLynn Shei

A contributor leading the adapter effort for Terminal Bench.

toolClaude 4.5 Sonnet
platformFreelancer
toolSweetBench Pro

More from Latent Space

View all 63 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free