A benchmark for evaluating AI agents' ability to perform tasks in a terminal environment.
Latent Space