HumanEval

Software / App

An evaluation that SWE-Bench Verified was initially created as part of, related to tracking model autonomy.

Mentioned in 5 videos

Videos Mentioning HumanEval

[LLM Paper Club] Llama 3.1 Paper: The Llama Family of Models

Latent Space

A benchmark dataset used for evaluating coding capabilities of language models.

State of the Art: Training 70B LLMs on 10,000 H100 clusters

Latent Space

A coding benchmark referenced in a bet about DBRX's performance, initially looking quite bad but ultimately exceeded expectations.

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Latent Space

A coding benchmark noted as being saturated and contaminated, similar to GSM8K, making it less effective for truly evaluating new model capabilities.

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Latent Space

An evaluation that SWE-Bench Verified was initially created as part of, related to tracking model autonomy.

[Paper Club] SWE-Bench [OpenAI Verified/Multimodal] + MLE-Bench with Jesse Hu

Latent Space

An existing benchmark for evaluating coding models, criticized for being too easy and allowing models to simply output answers without complex problem-solving.