SWAG

Software / App

Another benchmark that appears to suffer from the 'first character equals final answer' issue, similar to issues found in the MMLU.

Mentioned in 1 video