GSM8K
Concept
A benchmark for mathematical reasoning that, along with others like HumanEval, is considered 'contaminated' or saturated, meaning high scores are no longer truly indicative of breakthrough performance.
Mentioned in 2 videos
Videos Mentioning GSM8K

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Latent Space
A benchmark for mathematical reasoning that, along with others like HumanEval, is considered 'contaminated' or saturated, meaning high scores are no longer truly indicative of breakthrough performance.

State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
A math reasoning benchmark, noted for having some 'weird' qualities that require careful interpretation.