SWE-Bench
Software / AppMentioned in 4 videos
The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.
Videos Mentioning SWE-Bench

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals
Latent Space
The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.

GPT 4.1: The New OpenAI Workhorse
Latent Space
An evaluation benchmark for AI models' ability to complete software engineering tasks, where GPT-4.1 showed significant improvements.

The #1 SWE-Bench Verified Agent
Latent Space
A benchmark used by Augment Code to evaluate and refine their agent capabilities, particularly for achieving verified agent status.
![Solve coding, solve AGI [Reflection.ai launch w/ CEO Misha Laskin]](https://i.ytimg.com/vi/DIu7xA898go/maxresdefault.jpg)
Solve coding, solve AGI [Reflection.ai launch w/ CEO Misha Laskin]
Latent Space
A benchmark used for evaluating autonomous coding capabilities, noted as useful but potentially not fully representative of real-world customer settings.