Terminal-Bench: Pushing Claude Code, OpenAI Codex, Factory Droid, et al to the limits
Key Moments
Terminal-Bench: a coding agent benchmark focused on terminal interaction.
Key Insights
Terminal-based interaction is favored over GUI for AI agents due to text's native compatibility with LLMs and efficiency.
Terminal-Bench began before advanced coding agents like Claude Code, but its flexible design accommodated them.
The benchmark is designed for future agent capabilities, not just current uses, encompassing non-coding tasks.
Terminal-Bench aims to be a framework for creating, hosting, and distributing benchmarks, potentially for internal company use.
Evaluating agents requires multi-dimensional metrics beyond accuracy, including cost, latency, and economic value.
Separating agent performance from model performance is crucial for understanding true AI advancement.
THE ORIGINS AND VISION OF TERMINAL-BENCH
Alex Shaw and Mike Merrill, creators of Terminal-Bench, discuss its genesis from a $1 million prize on Sweetbench to an industry standard for coding agent benchmarks. Initially driven by a desire for more autonomy and risk-taking, the project evolved. The core idea, influenced by Andy Conwinsky, is that the terminal is the ultimate tool for arbitrary computer tasks. This led to the development of Terminal-Bench as a broader abstraction of Sweetbench, focusing on any task completable via code in a terminal, not just GitHub repositories and pull requests.
THE STRATEGIC ADVANTAGE OF TERMINAL INTERACTION
The decision to focus on terminal-based interaction over GUI was strategic. Text is the native modality for large language models, making terminal interfaces the most performant way for models to reason and execute tasks. Unlike GUIs, which are often designed for simplicity and can involve many clicks for complex operations, a single terminal command can achieve the same result. This efficiency and directness make the terminal an ideal environment for AI agents to interact with computer systems.
DESIGNING FOR THE FUTURE OF AGENTS
Terminal-Bench tasks are defined by an instruction, a container environment, and a test script, allowing for a wide range of operations beyond traditional coding. While coding tasks form a majority, the benchmark includes games, mathematics, and even non-coding applications like automating email drafts or journal entries. The design anticipates the future capabilities of agents, aiming to evaluate them on tasks requiring complex, multi-step reasoning and execution, rather than just present-day usage patterns.
EVOLVING BENCHMARKS AND EVALUATION METHODOLOGIES
Terminal-Bench has adapted existing benchmarks and aims to be a comprehensive evaluation harness. The creators emphasize the difficulty of creating high-quality, unique tasks, which cannot simply be scraped from online sources or easily generated by language models. The future of evaluation, they suggest, lies in observing real-world job tasks and translating them into benchmark formats, requiring more specialization and acknowledging that current benchmarks might be addressing the 'low-hanging fruit'.
AGENT VS. MODEL PERFORMANCE DISTINCTION
To truly assess underlying AI capabilities, it's crucial to distinguish between the performance of the model itself and the agent framework used. Terminus, Terminal-Bench's own research preview agent, is designed to be simple and unopinionated, using only a headless terminal for interaction. This minimal setup helps isolate the model's capabilities by avoiding extensive tool development or context management optimizations that might be present in lab-specific agents like Claude Code or Codex CLI.
THE MULTI-DIMENSIONAL FUTURE OF EVALUATION
Beyond simple accuracy, future evaluations will increasingly incorporate multiple dimensions, such as cost, latency, and economic value. The ultimate eval will likely measure the actual money an agent makes or saves when deployed for real-world tasks, like operating as a software engineer or managing investments. This shift acknowledges that simply getting tasks right isn't enough; agents must be efficient and deliver tangible economic benefits, moving away from one-dimensional charts to more comprehensive, profit-and-loss assessments.
TERMINAL-BENCH AS A FRAMEWORK AND ROADMAP
Terminal-Bench is evolving beyond just a benchmark into a robust framework for interactive agent benchmarking and reinforcement learning post-training. The roadmap includes making it easier to create tasks, synthesize tasks from data, run evaluations at scale in the cloud, and version/distribute benchmarks. They aim to provide a superior developer experience, allowing users to bring their agents and have a one-stop shop for evaluation, and even envision its use for private, internal evaluation sets and RL training.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Studies Cited
●Concepts
●People Referenced
Common Questions
Terminal Bench is a benchmark designed to evaluate AI agents' capabilities in performing tasks within a terminal environment. It was created by Alex and Mike from Turtlebench to provide a more robust and terminal-focused evaluation framework compared to existing benchmarks, inspired by the potential of text-based AI interaction.
Topics
Mentioned in this video
A model designed by Jeffrey Lee for text classification, likely related to the 'dataccom for language models' paper.
An event or conference where the guests are invited to speak further.
A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.
Co-founder of Databricks and Perplexity, involved in founding LA Institute and Law Ventures.
A contributor to Terminal Bench and a PhD student of Ludwig.
An agent mentioned in the context of specialized agents.
A framework Terminal Bench is evolving into, allowing for post-training models using reinforcement learning.
A postdoc at Stanford working with Ludwig Schmidt, and a creator of Terminal Bench.
An example used to illustrate the inefficiency of GUI-based systems compared to terminal commands.
An example of agent labs using various models to build their own agents.
An individual at Anthropic who alerted Alex and Mike to their mention on the Claw 4 model card and contributed tasks.
A benchmark that has been adapted for Terminal Bench.
A $1 million prize on Sweetbench that Alex initially worked on.
An agent from OpenAI, mentioned as an example of specialized agents.
Law Institute, where Alex works, with a culture of shipping research into usable products.
The organization Alex and Mike are from, creators of Terminal Bench.
An organization where Alex works and Andy Conwinsky is involved.
An AI model mentioned in the Claw 4 model card, which listed Terminal Bench.
A biological problem addressed by a Terminal Bench task, involving assembly of sequences.
Turtlebench's own research preview agent designed for simple and unopinionated testing.
Company whose t-shirt Allesio is wearing, claiming top performance on Terminal Bench.
An organization co-founded by Andy Conwinsky.
A benchmark mentioned as not yet adapted for Terminal Bench.
University where Alex studied math and computer science.
A previous paper by Jeffrey Lee and Ludwig where the need for designing a fast text classifier arose.
Reinforcement learning, mentioned in relation to how Anthropic might optimize agents like Claude Code.
Mike Merrill's collaborator and PhD advisor at Stanford.
A contributor leading the adapter effort for Terminal Bench.
More from Latent Space
View all 63 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free