MBPP
Study / Research
A benchmark used in the industry to evaluate LLMs for coding tasks, focusing on functional correctness.
Mentioned in 2 videos
Videos Mentioning MBPP

State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
A coding benchmark for which IMB created its own internal, cleaned version to remove ambiguity and verify examples.

AI Dev 25 x NYC | Manish Kapur: Assessing the Quality of AI Generated Code
DeepLearningAI
A benchmark used in the industry to evaluate LLMs for coding tasks, focusing on functional correctness.