SWE-Bench Verified
Software / AppMentioned in 2 videos
A coding benchmark that has reached saturation and contamination, leading to stalled progress measurement.
Videos Mentioning SWE-Bench Verified

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals
Latent Space
A coding benchmark that has reached saturation and contamination, leading to stalled progress measurement.

Is finetuning GPT4o worth it?
Latent Space
A smaller, more cost-effective version of SWE Bench, used by Cosign for faster iteration and evaluation of Genie.