Terminal Bench
Software / App
A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.
Mentioned in 2 videos
Save the 2 videos on Terminal Bench to your own pod.
Sign up free to keep building your knowledge base on Terminal Bench as more episodes are added.
Videos Mentioning Terminal Bench

Terminal-Bench: Pushing Claude Code, OpenAI Codex, Factory Droid, et al to the limits
Latent Space
A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation
Stanford Online
A benchmark that uses a computer terminal as the environment for agents to perform general-purpose tasks, with tasks crowdsourced globally.