SWE-Bench
The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.
Common Themes
Videos Mentioning SWE-Bench

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis
Latent Space
A benchmark for evaluating the performance of AI coding agents, where Lewis's agent achieved the #1 ranking.

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals
Latent Space
The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)
Latent Space
A coding benchmark based on real-world GitHub pull requests, used for evaluating agent performance on fixing issues.

Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI
Latent Space
A benchmark used for evaluating AI models on coding tasks, which requires submission of reasoning traces, a factor in why some models might not be listed.
![[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu](https://i.ytimg.com/vi/ULcwHlxfSkQ/maxresdefault.jpg)
[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu
Latent Space
A benchmark designed to evaluate real-world coding tasks that require understanding codebases, editing multiple files, and running tests, aiming to be much harder than previous benchmarks.

Is finetuning GPT4o worth it?
Latent Space
A benchmark used to evaluate the performance of AI models in software engineering tasks, which Cosign's Genie has achieved high scores on.

GPT 4.1: The New OpenAI Workhorse
Latent Space
An evaluation benchmark for AI models' ability to complete software engineering tasks, where GPT-4.1 showed significant improvements.

The #1 SWE-Bench Verified Agent
Latent Space
A benchmark used by Augment Code to evaluate and refine their agent capabilities, particularly for achieving verified agent status.
![Solve coding, solve AGI [Reflection.ai launch w/ CEO Misha Laskin]](https://i.ytimg.com/vi/DIu7xA898go/maxresdefault.jpg)
Solve coding, solve AGI [Reflection.ai launch w/ CEO Misha Laskin]
Latent Space
A benchmark used for evaluating autonomous coding capabilities, noted as useful but potentially not fully representative of real-world customer settings.