Stanford HELM
Software / AppMentioned in 2 videos
A project likely related to model evaluation, mentioned in the context of collecting benchmark numbers.
Videos Mentioning Stanford HELM

The Agent Reasoning Interface: Claude, ChatGPT Canvas, Tasks, Operator — with Karina Nguyen, OpenAI
Latent Space
A benchmark evaluation where Claude reportedly performed poorly due to incorrect prompting techniques, illustrating the challenges in consistent model evaluation.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith
Latent Space
A project likely related to model evaluation, mentioned in the context of collecting benchmark numbers.