Web of Lies

Study / Research

A benchmark where models can achieve 100% if trained on that specific reasoning task, highlighting potential brittleness.

Mentioned in 1 video