How does Terminal Bench differ from benchmarks like Sweetbench?

Terminal Bench is a broader abstraction that is not specific to repositories and pull requests like Sweetbench. It aims to evaluate agents on any task that can be accomplished using code in a terminal, offering a more generic and versatile approach.

What types of tasks are included in Terminal Bench?

Terminal Bench includes a diverse range of tasks beyond just coding, such as tasks involving mathematics, games, and even scientific research like DNA sequence assembly. The goal is to represent anything achievable on a computer via the terminal.

How does Terminal Bench help in evaluating AI models versus their agents?

Terminal Bench, particularly with the use of agents like Terminus, helps separate the performance of the underlying AI model from the specific agent framework or harness. This allows for a clearer understanding of the model's fundamental capabilities.

What is the future of AI benchmarks according to the creators of Terminal Bench?

The future of AI evals is expected to move beyond single-dimensional metrics like accuracy. Benchmarks will likely incorporate multiple dimensions, such as cost and latency, and ultimately focus on the economic value generated by the AI agent in real-world applications.

What is Terminal Bench's roadmap and future vision?

Terminal Bench is evolving beyond a benchmark into a framework for interactive agent benchmarking, RL post-training, and task creation. Future plans include cloud hosting, scaled evaluations, and enabling companies to create private evaluation sets and use tasks for model post-training.

How can people contribute to or get involved with Terminal Bench?

Interested individuals are encouraged to join the Terminal Bench Discord server, where the team actively engages with new members. Contributions to task creation and adapter development are also welcome, aiming to expand the benchmark's scope and community.

Key Moments

Terminal-Bench: Pushing Claude Code, OpenAI Codex, Factory Droid, et al to the limits

Latent Space Podcast

Science & Technology3 min read36 min video

Oct 18, 2025|3,739 views|72|6

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Terminal-Bench: a coding agent benchmark focused on terminal interaction.

Key Insights

Terminal-based interaction is favored over GUI for AI agents due to text's native compatibility with LLMs and efficiency.

Terminal-Bench began before advanced coding agents like Claude Code, but its flexible design accommodated them.

The benchmark is designed for future agent capabilities, not just current uses, encompassing non-coding tasks.

Terminal-Bench aims to be a framework for creating, hosting, and distributing benchmarks, potentially for internal company use.

Evaluating agents requires multi-dimensional metrics beyond accuracy, including cost, latency, and economic value.

Separating agent performance from model performance is crucial for understanding true AI advancement.

THE ORIGINS AND VISION OF TERMINAL-BENCH

Alex Shaw and Mike Merrill, creators of Terminal-Bench, discuss its genesis from a $1 million prize on Sweetbench to an industry standard for coding agent benchmarks. Initially driven by a desire for more autonomy and risk-taking, the project evolved. The core idea, influenced by Andy Conwinsky, is that the terminal is the ultimate tool for arbitrary computer tasks. This led to the development of Terminal-Bench as a broader abstraction of Sweetbench, focusing on any task completable via code in a terminal, not just GitHub repositories and pull requests.

THE STRATEGIC ADVANTAGE OF TERMINAL INTERACTION

The decision to focus on terminal-based interaction over GUI was strategic. Text is the native modality for large language models, making terminal interfaces the most performant way for models to reason and execute tasks. Unlike GUIs, which are often designed for simplicity and can involve many clicks for complex operations, a single terminal command can achieve the same result. This efficiency and directness make the terminal an ideal environment for AI agents to interact with computer systems.

DESIGNING FOR THE FUTURE OF AGENTS

Terminal-Bench tasks are defined by an instruction, a container environment, and a test script, allowing for a wide range of operations beyond traditional coding. While coding tasks form a majority, the benchmark includes games, mathematics, and even non-coding applications like automating email drafts or journal entries. The design anticipates the future capabilities of agents, aiming to evaluate them on tasks requiring complex, multi-step reasoning and execution, rather than just present-day usage patterns.

EVOLVING BENCHMARKS AND EVALUATION METHODOLOGIES

Terminal-Bench has adapted existing benchmarks and aims to be a comprehensive evaluation harness. The creators emphasize the difficulty of creating high-quality, unique tasks, which cannot simply be scraped from online sources or easily generated by language models. The future of evaluation, they suggest, lies in observing real-world job tasks and translating them into benchmark formats, requiring more specialization and acknowledging that current benchmarks might be addressing the 'low-hanging fruit'.

AGENT VS. MODEL PERFORMANCE DISTINCTION

To truly assess underlying AI capabilities, it's crucial to distinguish between the performance of the model itself and the agent framework used. Terminus, Terminal-Bench's own research preview agent, is designed to be simple and unopinionated, using only a headless terminal for interaction. This minimal setup helps isolate the model's capabilities by avoiding extensive tool development or context management optimizations that might be present in lab-specific agents like Claude Code or Codex CLI.

THE MULTI-DIMENSIONAL FUTURE OF EVALUATION

Beyond simple accuracy, future evaluations will increasingly incorporate multiple dimensions, such as cost, latency, and economic value. The ultimate eval will likely measure the actual money an agent makes or saves when deployed for real-world tasks, like operating as a software engineer or managing investments. This shift acknowledges that simply getting tasks right isn't enough; agents must be efficient and deliver tangible economic benefits, moving away from one-dimensional charts to more comprehensive, profit-and-loss assessments.

TERMINAL-BENCH AS A FRAMEWORK AND ROADMAP

Terminal-Bench is evolving beyond just a benchmark into a robust framework for interactive agent benchmarking and reinforcement learning post-training. The roadmap includes making it easier to create tasks, synthesize tasks from data, run evaluations at scale in the cloud, and version/distribute benchmarks. They aim to provide a superior developer experience, allowing users to bring their agents and have a one-stop shop for evaluation, and even envision its use for private, internal evaluation sets and RL training.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Common Questions

Terminal Bench is a benchmark designed to evaluate AI agents' capabilities in performing tasks within a terminal environment. It was created by Alex and Mike from Turtlebench to provide a more robust and terminal-focused evaluation framework compared to existing benchmarks, inspired by the potential of text-based AI interaction.

Topics

Terminal Commands Containerization

Mentioned in this video

People

Lynn Shei

A contributor leading the adapter effort for Terminal Bench.

Ludwig Schmidt

Mike Merrill's collaborator and PhD advisor at Stanford.

Andy Conwinsky

Co-founder of Databricks and Perplexity, involved in founding LA Institute and Law Ventures.

Jeffrey Lee

A contributor to Terminal Bench and a PhD student of Ludwig.

Mike Merrill

A postdoc at Stanford working with Ludwig Schmidt, and a creator of Terminal Bench.

Nicholas Carlini

An individual at Anthropic who alerted Alex and Mike to their mention on the Claw 4 model card and contributed tasks.

Software & Apps

TCLM

A model designed by Jeffrey Lee for text classification, likely related to the 'dataccom for language models' paper.

Terminal Bench

A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.

Open Hands

An agent mentioned in the context of specialized agents.

Factory Droids

An example of agent labs using various models to build their own agents.

Codec CLI

An agent from OpenAI, mentioned as an example of specialized agents.

Claw 4

An AI model mentioned in the Claw 4 model card, which listed Terminal Bench.

Terminus

Turtlebench's own research preview agent designed for simple and unopinionated testing.

An event or conference where the guests are invited to speak further.

LAI

Law Institute, where Alex works, with a culture of shipping research into usable products.

Turtlebench

The organization Alex and Mike are from, creators of Terminal Bench.

LA Institute

An organization where Alex works and Andy Conwinsky is involved.

Law Ventures

An organization co-founded by Andy Conwinsky.

BYU

University where Alex studied math and computer science.

Concepts

RL post training

A framework Terminal Bench is evolving into, allowing for post-training models using reinforcement learning.

K prize

A $1 million prize on Sweetbench that Alex initially worked on.

DNA sequences

A biological problem addressed by a Terminal Bench task, involving assembly of sequences.

Reinforcement learning, mentioned in relation to how Anthropic might optimize agents like Claude Code.

Products

EC2 instance

An example used to illustrate the inefficiency of GUI-based systems compared to terminal commands.

Studies & Research

Sweet Bench Verify

A benchmark that has been adapted for Terminal Bench.

Swans

A benchmark mentioned as not yet adapted for Terminal Bench.

dataccom for language models

A previous paper by Jeffrey Lee and Ludwig where the need for designing a fast text classifier arose.

Companies

Factory AI

Company whose t-shirt Allesio is wearing, claiming top performance on Terminal Bench.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free