Stanford HELM

Software / App

A project likely related to model evaluation, mentioned in the context of collecting benchmark numbers.

Mentioned in 2 videos

Videos Mentioning Stanford HELM

The Agent Reasoning Interface: Claude, ChatGPT Canvas, Tasks, Operator — with Karina Nguyen, OpenAI

Latent Space

A benchmark evaluation where Claude reportedly performed poorly due to incorrect prompting techniques, illustrating the challenges in consistent model evaluation.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space

A project likely related to model evaluation, mentioned in the context of collecting benchmark numbers.