LLM Evaluation

11 video summaries

Build a research pod on LLM Evaluation.

11 videos curated. Save them to your own pod, ask any question across the body of expert opinion, and connect it to Claude or ChatGPT.

Get Started Free

Videos About LLM Evaluation

Stanford CS547 HCI Seminar | Winter 2026 | Creation, Evolution, and Formalization of Notations

Stanford CS547 HCI Seminar | Winter 2026 | Creation, Evolution, and Formalization of Notations

Stanford Online

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis

Latent Space

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

[Paper Club] Who Validates the Validators? Aligning LLM-Judges with Humans (w/ Eugene Yan)

Latent Space

Production AI Engineering starts with Evals

Production AI Engineering starts with Evals

Latent Space

[Paper Club] DocETL: Agentic Query Rewriting + Eval for Complex Document Processing w Shreya Shankar

[Paper Club] DocETL: Agentic Query Rewriting + Eval for Complex Document Processing w Shreya Shankar

Latent Space

How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

Latent Space

Your Ground Truth Is Wrong: Evaluating STT with truth files & semantic WER | AssemblyAI Workshop

Your Ground Truth Is Wrong: Evaluating STT with truth files & semantic WER | AssemblyAI Workshop

AssemblyAI

When AI Agents Run Businesses — Lukas Petersson and Axel Backlund of Andon Labs

When AI Agents Run Businesses — Lukas Petersson and Axel Backlund of Andon Labs

Latent Space

RAG is a hack - with Jerry Liu of LlamaIndex

RAG is a hack - with Jerry Liu of LlamaIndex

Latent Space

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

Stanford Online

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Personal benchmarks vs HumanEval - with Nicholas Carlini of DeepMind

Latent Space