Ruler Suite

Software / App

A more comprehensive set of benchmarks for evaluating long context models, including multi-needle retrieval, variable tracking, and summary statistics.

Mentioned in 1 video