Terminal Bench

Software / App

A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.

Mentioned in 2 videos

Save the 2 videos on Terminal Bench to your own pod.

Sign up free to keep building your knowledge base on Terminal Bench as more episodes are added.

Get Started Free

Videos Mentioning Terminal Bench

Terminal-Bench: Pushing Claude Code, OpenAI Codex, Factory Droid, et al to the limits

Terminal-Bench: Pushing Claude Code, OpenAI Codex, Factory Droid, et al to the limits

Latent Space

A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

Stanford Online

A benchmark that uses a computer terminal as the environment for agents to perform general-purpose tasks, with tasks crowdsourced globally.