W

Web of Lies

Study / ResearchMentioned in 1 video

A benchmark where models can achieve 100% if trained on that specific reasoning task, highlighting potential brittleness.