AlpacaEval

Software / App

A benchmark that uses LLMs as judges to evaluate model responses, initially favoring longer outputs but later debiased with regression methods.

Mentioned in 1 video