HELM
Concept
A suite of benchmarks that present challenges for comparison due to accepting different answer formats.
Mentioned in 2 videos
Videos Mentioning HELM

FlashAttention-2: Making Transformers 800% faster AND exact
Latent Space
A holistic benchmark for evaluating language models, developed by the Stanford Center for Foundation Models.

SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors
AI Explained
A suite of benchmarks that present challenges for comparison due to accepting different answer formats.