Sweet Bench
Concept
A benchmark used for evaluating reasoning capabilities of language models, where fine-tuning with reasoning data led to outperformance of OpenAI O1.
Mentioned in 2 videos
Save the 2 videos on Sweet Bench to your own pod.
Sign up free to keep building your knowledge base on Sweet Bench as more episodes are added.
Videos Mentioning Sweet Bench

The Unreasonable Effectiveness of Reasoning Distillation: using DeepSeek R1 to beat OpenAI o1
Latent Space
A benchmark used for evaluating reasoning capabilities of language models, where fine-tuning with reasoning data led to outperformance of OpenAI O1.

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic
Latent Space
A benchmark developed to evaluate the performance of coding agents, focusing on real-world engineering tasks within existing code repositories.