Stanford HELM
Software / App
A project likely related to model evaluation, mentioned in the context of collecting benchmark numbers.
Mentioned in 2 videos
Save the 2 videos on Stanford HELM to your own pod.
Sign up free to keep building your knowledge base on Stanford HELM as more episodes are added.
Videos Mentioning Stanford HELM

The Agent Reasoning Interface: Claude, ChatGPT Canvas, Tasks, Operator — with Karina Nguyen, OpenAI
Latent Space
A benchmark evaluation where Claude reportedly performed poorly due to incorrect prompting techniques, illustrating the challenges in consistent model evaluation.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith
Latent Space
A project likely related to model evaluation, mentioned in the context of collecting benchmark numbers.