SWE-Bench

Software / App

The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.

Mentioned in 9 videos

Videos Mentioning SWE-Bench

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis

Latent Space

A benchmark for evaluating the performance of AI coding agents, where Lewis's agent achieved the #1 ranking.

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Latent Space

The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)

Latent Space

A coding benchmark based on real-world GitHub pull requests, used for evaluating agent performance on fixing issues.

Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI

Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI

Latent Space

A benchmark used for evaluating AI models on coding tasks, which requires submission of reasoning traces, a factor in why some models might not be listed.

[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu

[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu

Latent Space

A benchmark designed to evaluate real-world coding tasks that require understanding codebases, editing multiple files, and running tests, aiming to be much harder than previous benchmarks.

Is finetuning GPT4o worth it?

Is finetuning GPT4o worth it?

Latent Space

A benchmark used to evaluate the performance of AI models in software engineering tasks, which Cosign's Genie has achieved high scores on.

GPT 4.1: The New OpenAI Workhorse

GPT 4.1: The New OpenAI Workhorse

Latent Space

An evaluation benchmark for AI models' ability to complete software engineering tasks, where GPT-4.1 showed significant improvements.

The #1 SWE-Bench Verified Agent

The #1 SWE-Bench Verified Agent

Latent Space

A benchmark used by Augment Code to evaluate and refine their agent capabilities, particularly for achieving verified agent status.

Solve coding, solve AGI [Reflection.ai launch w/ CEO Misha Laskin]

Solve coding, solve AGI [Reflection.ai launch w/ CEO Misha Laskin]

Latent Space

A benchmark used for evaluating autonomous coding capabilities, noted as useful but potentially not fully representative of real-world customer settings.