Sweet Bench
ConceptMentioned in 2 videos
A benchmark used for evaluating reasoning capabilities of language models, where fine-tuning with reasoning data led to outperformance of OpenAI O1.
Videos Mentioning Sweet Bench

The Unreasonable Effectiveness of Reasoning Distillation: using DeepSeek R1 to beat OpenAI o1
Latent Space
A benchmark used for evaluating reasoning capabilities of language models, where fine-tuning with reasoning data led to outperformance of OpenAI O1.

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic
Latent Space
A benchmark developed to evaluate the performance of coding agents, focusing on real-world engineering tasks within existing code repositories.