A suite of benchmarks that present challenges for comparison due to accepting different answer formats.
AI Explained