GSM8K
ConceptMentioned in 2 videos
A benchmark for mathematical reasoning that, along with others like HumanEval, is considered 'contaminated' or saturated, meaning high scores are no longer truly indicative of breakthrough performance.
Videos Mentioning GSM8K

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Latent Space
A benchmark for mathematical reasoning that, along with others like HumanEval, is considered 'contaminated' or saturated, meaning high scores are no longer truly indicative of breakthrough performance.

State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
A math reasoning benchmark, noted for having some 'weird' qualities that require careful interpretation.