AlpacaEval
Software / App
A benchmark that uses LLMs as judges to evaluate model responses, initially favoring longer outputs but later debiased with regression methods.
Mentioned in 1 video
A benchmark that uses LLMs as judges to evaluate model responses, initially favoring longer outputs but later debiased with regression methods.