HumanEval
An evaluation that SWE-Bench Verified was initially created as part of, related to tracking model autonomy.
Videos Mentioning HumanEval
![[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models](https://i.ytimg.com/vi/TgLSYIBoX5U/maxresdefault.jpg)
[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models
Latent Space
A benchmark dataset used for evaluating coding capabilities of language models.

State of the Art: Training 70B LLMs on 10,000 H100 clusters
Latent Space
A coding benchmark referenced in a bet about DBRX's performance, initially looking quite bad but ultimately exceeded expectations.

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Latent Space
A coding benchmark noted as being saturated and contaminated, similar to GSM8K, making it less effective for truly evaluating new model capabilities.

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals
Latent Space
An evaluation that SWE-Bench Verified was initially created as part of, related to tracking model autonomy.
![[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu](https://i.ytimg.com/vi/ULcwHlxfSkQ/maxresdefault.jpg)
[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu
Latent Space
An existing benchmark for evaluating coding models, criticized for being too easy and allowing models to simply output answers without complex problem-solving.