How is the ARC AGI benchmark different from other AI tests like MMLU?

Traditional benchmarks like MMLU often present 'PhD++ problems' that are increasingly difficult for AI. In contrast, ARC AGI benchmarks are designed so that normal humans can solve them, testing the AI's ability to learn and generalize rather than just brute-force knowledge.

What specific AI models have recently been evaluated using the ARC AGI benchmark?

In the past year, major AI labs have used the ARC AGI benchmark to report performance for their advanced models, including OpenAI, XAI (with Gro 4), Gemini (with Gemini 3 Pro), and Anthropic (with Opus 45).

What are common 'false positives' or misleading indicators of progress in AI development?

One common false positive is relying heavily on Reinforcement Learning (RL) environments. While these can show short-term gains, they don't necessarily reflect true generalization, as RL environments can be created for every specific task, unlike how humans learn.

What is the key innovation in ARC AGI 3?

ARC AGI 3 will be an interactive benchmark featuring approximately 150 gamified environments. A significant feature is that no instructions will be provided, requiring test-takers to learn the goals and mechanics through experimentation.

Besides accuracy, what other factors are important for measuring AI intelligence?

Beyond accuracy, the amount of training data needed and the energy consumed are crucial metrics. ARC AGI 3 will also measure efficiency by counting the number of actions required to complete tasks, normalizing AI performance to average human performance.

If a model scores 100% on ARC AGI, does that mean we have achieved AGI?

According to France's original proposal, solving the ARC AGI benchmarks (versions 1 and 2) is necessary but not sufficient for AGI. While version 3 aims to be the most authoritative evidence of generalization to date, it still may not be full AGI, emphasizing the need for continued research and analysis.

Key Moments

How Intelligent Is AI, Really?

Y Combinator

Science & Technology5 min read12 min video

Dec 17, 2025|13,807 views|307|9

YC Y Combinator

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Current AI benchmarks are easily gamed, but the ARC-AGI test evaluates true learning ability, revealing current models' significant limitations in generalization.

Key Insights

Intelligence is defined by the ability to learn new things efficiently, not just by scoring high on difficult tests.

Humans can score 100% on the ARC-AGI benchmark, while early GPT-4 base models scored only 4-5%.

ARC-AGI 2, released in March 2025, is a deeper version of the original static benchmark.

ARC-AGI 3, launching next year, will be an interactive benchmark using 150 video game environments with no instructions, testing adaptivity.

Future AGI will likely be declared using interactive benchmarks, reflecting real-world action-reaction dynamics.

ARC-AGI aims to measure efficiency not just by accuracy, but by the number of actions (AI vs. human) and energy consumed.

Intelligence redefined: learning over memorization

The traditional view of intelligence often equates it with scoring high on standardized tests or mastering complex tasks. However, François Chollet's proposed definition, forming the basis of the ARC Prize Foundation's work, redefines intelligence as the "'ability to learn new things." This perspective shifts the focus from sheer knowledge acquisition or computational power to the capacity for efficient learning and generalization. While AI excels at specific, well-defined tasks like chess or Go, mastering entirely new skills remains a significant challenge. The ARC benchmark, therefore, was designed not to test how hard a problem AI can solve, but how quickly and effectively it can learn to solve new, unseen problems, mirroring human learning capabilities. This has profound implications for how we measure AI progress, moving beyond "PhD++" problems that only push the boundaries of existing AI capabilities and towards assessments that reveal true adaptability.

ARC-AGI: A benchmark for generalization

The ARC benchmark, initially proposed in 2019, serves as a crucial tool for evaluating an AI's ability to generalize. Unlike many other benchmarks that focus on increasingly difficult problems solvable with brute-force computation or massive datasets, ARC problems are designed to be solvable by average humans. This ensures that performance on ARC is indicative of genuine reasoning and learning, rather than just a high degree of specialized training. Early results were stark: while humans could achieve near-perfect scores, major Large Language Models (LLMs) like GPT-4, even without reasoning capabilities, struggled, scoring as low as 4-5%. This significant gap highlighted the limitations of pre-training and the emergent importance of reasoning paradigms. The subsequent jump in performance to 21% with models like 01 preview demonstrated the impact of architectural advancements and reinforced ARC's value in identifying such breakthroughs.

Evolution of the ARC benchmark

The ARC benchmark has evolved through several iterations to better capture the nuances of general intelligence. ARC-AGI 1, released in 2019 by François Chollet, featured 800 manually created tasks designed to test learnability. In March 2025, ARC-AGI 2 was introduced as an upgraded, deeper version of the static benchmark, offering more complex challenges within the same framework. The most significant upcoming development is ARC-AGI 3, slated for release next year. This next iteration will introduce interactivity, moving beyond static problem sets to dynamic, game-like environments. This shift is driven by the belief that true intelligence is demonstrated through continuous interaction with an environment, involving taking actions and learning from feedback, mirroring real-world scenarios. This interactive nature is seen as a crucial step towards declaring AGI.

ARC-AGI 3: Interactive and instruction-less environments

ARC-AGI 3 represents a major leap forward by introducing interactive elements and removing explicit instructions. Instead of relying on textual or symbolic guidance, test-takers will be presented with approximately 150 video game-like environments where they must infer the goal through experimentation. By taking actions and observing the system's response, both humans and AI must learn the objective. Crucially, similar to previous versions, every ARC-AGI 3 game will be tested on a panel of general public members to establish a minimum solvability threshold. Games failing to meet this threshold will be excluded, ensuring that the benchmark remains accessible to ordinary humans while challenging AI. This ensures that success on ARC-AGI 3 will definitively indicate a level of intelligence that current AI systems lack.

Measuring efficiency beyond accuracy

While accuracy is a component, ARC-AGI is increasingly focused on measuring the 'efficiency' of intelligence. This encompasses not only the amount of training data required but also the computational resources and energy consumed. Unlike wall-clock time, which can be manipulated by throwing more compute at a problem, the number of data points and energy expenditure offer more fundamental metrics. For ARC-AGI 3, efficiency will be measured by comparing the number of actions an AI takes to solve a game against the average number of actions taken by humans. This approach directly contrasts with older methods, such as those used in the Atari days, which relied on brute-force solutions requiring millions of frames and actions. ARC-AGI 3 will normalize AI performance against human averages, penalizing inefficient, brute-force strategies and rewarding generalized problem-solving.

The path to AGI: necessary but not sufficient

Achieving a perfect score on ARC-AGI benchmarks, even ARC-AGI 3, is considered a necessary but not sufficient condition for AGI. If a model were to achieve 100% on the ARC-AGI benchmarks tomorrow, it would provide the most authoritative evidence to date of a system's generalization capabilities. However, the ARC Prize Foundation emphasizes that AGI involves potentially more complex and multifaceted abilities. The focus remains on understanding the failure points even in high-scoring systems and on developing benchmarks that can robustly identify true AGI when it emerges. Ultimately, the goal of the ARC Prize Foundation is to be equipped to reliably identify and declare AGI, ensuring that progress is accurately measured and genuinely transformative advancements are recognized.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

Comparison of ARC AGI Benchmark Versions

Data extracted from this episode

Version	Release Year	Type	Key Feature
ARC AGI 1	2019	Static	800 tasks created by France
ARC AGI 2	2025	Static	Upgraded version of ARC AGI 1
ARC AGI 3	Next Year (2026)	Interactive	150 gamified environments, no instructions

Common Questions

The ARC Prize Foundation defines intelligence as the ability to learn new things efficiently. This is a shift from traditional benchmarks that focus on knowledge recall or problem-solving difficulty.

Topics

Ai-Ethics AI & Machine Learning Technology & Innovation Science & Mathematics AI Evaluation Artificial General Intelligence Machine Learning AI Research AI Benchmarks

Mentioned in this video

Organizations

Ark Prize Foundation

A nonprofit organization focused on advancing open progress towards AI systems that can generalize like humans. It uses the ARC benchmark to measure intelligence.

Software & Apps

GPT-4

The base model of GPT-4, tested on the ARC benchmark in 2024, initially scored very low (4-5%), highlighting the challenge for LLMs on learning new tasks without specific pre-training.

ARC AGI 1

The initial version of the ARC AGI benchmark, released in 2019, proposed by France and comprising 800 tasks he created.

Opus 4.5

A model released by Anthropic, measured using the ArcGI benchmark as part of the trend of major AI labs adopting the benchmark for performance reporting.

Arc AGI 2

An upgraded version of the ARC AGI benchmark, released in March 2025, which is a deeper, but still static, benchmark compared to ARC AGI 1.

ARC AGI benchmark

A benchmark proposed by France to test an AI's ability to learn new things, designed such that normal humans can solve the tasks, unlike more complex PhD++ problems.

Gemini 3 Pro

A model from Gemini (likely Google's AI) measured using the ArcGI benchmark, indicating the benchmark's widespread use in reporting AI performance.

Gro 4

A model release from XAI that is being measured using the ArcGI benchmark, as part of the increasing adoption of the benchmark by major AI labs.

ARC AGI 3

The upcoming interactive version of the ARC AGI benchmark, set to be released next year, featuring 150 gamified environments with no instructions provided.

MMLU

A benchmark mentioned as an example of 'PhD++ problems' in AI, which current models are surpassing, in contrast to the ARC benchmarks that normal people can solve.

SAT test

Mentioned as a traditional example of how intelligence is often measured (e.g., scoring high), contrasting with the ARC benchmark's focus on learning ability.

Concepts

Reinforcement Learning

Environments for RL are discussed as a common approach in AI development, but the speaker cautions against them as a sole measure of progress, likening it to 'whack-a-mole' and emphasizing the need for generalization without predefined environments.

intelligence

Defined as the ability to learn new things more efficiently, this is the core concept behind the ARC benchmark, contrasting with traditional measures of knowledge recall.

Companies

OpenAI

A leading AI research lab that uses the ArcGI benchmark for reporting model performance, signifying its adoption as a standard for evaluating AI capabilities.

DeepMind

Associated with Gemini 3 Pro in the context of AI model releases being measured by the ArcGI benchmark.

Anthropic

A company that released Opus 45, which is being evaluated using the ArcGI benchmark, showcasing the benchmark's increasing importance in the AI field.

XAI

A prominent AI lab that, along with OpenAI and others, is now using the ArcGI benchmark for reporting model performance, signifying its adoption as a standard for evaluating AI capabilities.

Products

Atari

Mentioned in the context of early video game AI development in 2016, where brute force solutions required millions of frames and actions.

Books

The Measure of Intelligence

A paper by France that proposed a new definition of intelligence focused on the ability to learn and introduced the ARC benchmark.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free