SWE-Bench Verified
Software / App
A coding benchmark that has reached saturation and contamination, leading to stalled progress measurement.
Mentioned in 2 videos
Save the 2 videos on SWE-Bench Verified to your own pod.
Sign up free to keep building your knowledge base on SWE-Bench Verified as more episodes are added.
Videos Mentioning SWE-Bench Verified

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals
Latent Space
A coding benchmark that has reached saturation and contamination, leading to stalled progress measurement.

Is finetuning GPT4o worth it?
Latent Space
A smaller, more cost-effective version of SWE Bench, used by Cosign for faster iteration and evaluation of Genie.