SWE-Bench

Software / App

The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.

Mentioned in 10 videos

What podcasters actually say about SWE-Bench.

10 mentions, no marketing. Save them all to a pod and ask any question.

Get Started Free

Common Themes

Technology & Innovation AI & Machine Learning Programming & Software Large Language Models Ai Safety Ai Agents AI Development Software Development Prompt Engineering Developer Tools

Videos Mentioning SWE-Bench

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis

Latent Space

A benchmark for evaluating the performance of AI coding agents, where Lewis's agent achieved the #1 ranking.

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Latent Space

The original academic coding benchmark from a lab at Princeton, which SWE-Bench Verified was a cleaned-up version of.

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)

Latent Space

A coding benchmark based on real-world GitHub pull requests, used for evaluating agent performance on fixing issues.

Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI

Latent Space

A benchmark used for evaluating AI models on coding tasks, which requires submission of reasoning traces, a factor in why some models might not be listed.

[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu

Latent Space

A benchmark designed to evaluate real-world coding tasks that require understanding codebases, editing multiple files, and running tests, aiming to be much harder than previous benchmarks.

Is finetuning GPT4o worth it?

Latent Space

A benchmark used to evaluate the performance of AI models in software engineering tasks, which Cosign's Genie has achieved high scores on.

GPT 4.1: The New OpenAI Workhorse

Latent Space

An evaluation benchmark for AI models' ability to complete software engineering tasks, where GPT-4.1 showed significant improvements.

The #1 SWE-Bench Verified Agent

Latent Space

A benchmark used by Augment Code to evaluate and refine their agent capabilities, particularly for achieving verified agent status.

Solve coding, solve AGI [Reflection.ai launch w/ CEO Misha Laskin]

Latent Space

A benchmark used for evaluating autonomous coding capabilities, noted as useful but potentially not fully representative of real-world customer settings.

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

Stanford Online

A benchmark for evaluating agentic capabilities, specifically for coding tasks, where agents submit PRs to fix GitHub issues and are evaluated by passing unit tests.