Sweetbench
A benchmark where Claude models are noted to be ahead in coding.
Common Themes
Videos Mentioning Sweetbench

State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
A difficult, new coding benchmark focused on bug fixing, recognized for its realism but also for the challenge it poses for evaluation.

The Death of Data Gatekeeping: AI Makes Everyone An Analyst | Hex Cofounder
a16z Deep Dives
A benchmark where Claude models are noted to be ahead in coding.

2024 Year in Review: The Big Scaling Debate, the Four Wars of AI, Top Themes and the Rise of Agents
Latent Space
A leading benchmark for AI models, with specific focus on 'SweetBench Verified' and 'SweetBench Multimodal,' indicating evolving metrics for frontier models.

India’s Fastest Growing AI Startup
Y Combinator
A benchmark used to measure the performance of coding agents. Emergent aimed to become number one on this benchmark.

The AI Coding Factory
Latent Space
A benchmark used for evaluating LLMs, which Factory AI no longer competes on due to its irrelevance to enterprise use cases.

⚡️Warp 2.0: the Agentic Development Environment - Zach Lloyd and Ben Holmes
Latent Space
A benchmark that Warp is aiming to achieve state-of-the-art scores on for its coding agent.