Key Moments

How Intelligent Is AI, Really?

Y CombinatorY Combinator
Science & Technology5 min read12 min video
Dec 17, 2025|13,168 views|299|9
Save to Pod
TL;DR

Current AI benchmarks are easily gamed, but the ARC-AGI test evaluates true learning ability, revealing current models' significant limitations in generalization.

Key Insights

1

Intelligence is defined by the ability to learn new things efficiently, not just by scoring high on difficult tests.

2

Humans can score 100% on the ARC-AGI benchmark, while early GPT-4 base models scored only 4-5%.

3

ARC-AGI 2, released in March 2025, is a deeper version of the original static benchmark.

4

ARC-AGI 3, launching next year, will be an interactive benchmark using 150 video game environments with no instructions, testing adaptivity.

5

Future AGI will likely be declared using interactive benchmarks, reflecting real-world action-reaction dynamics.

6

ARC-AGI aims to measure efficiency not just by accuracy, but by the number of actions (AI vs. human) and energy consumed.

Intelligence redefined: learning over memorization

The traditional view of intelligence often equates it with scoring high on standardized tests or mastering complex tasks. However, François Chollet's proposed definition, forming the basis of the ARC Prize Foundation's work, redefines intelligence as the "'ability to learn new things." This perspective shifts the focus from sheer knowledge acquisition or computational power to the capacity for efficient learning and generalization. While AI excels at specific, well-defined tasks like chess or Go, mastering entirely new skills remains a significant challenge. The ARC benchmark, therefore, was designed not to test how hard a problem AI can solve, but how quickly and effectively it can learn to solve new, unseen problems, mirroring human learning capabilities. This has profound implications for how we measure AI progress, moving beyond "PhD++" problems that only push the boundaries of existing AI capabilities and towards assessments that reveal true adaptability.

ARC-AGI: A benchmark for generalization

The ARC benchmark, initially proposed in 2019, serves as a crucial tool for evaluating an AI's ability to generalize. Unlike many other benchmarks that focus on increasingly difficult problems solvable with brute-force computation or massive datasets, ARC problems are designed to be solvable by average humans. This ensures that performance on ARC is indicative of genuine reasoning and learning, rather than just a high degree of specialized training. Early results were stark: while humans could achieve near-perfect scores, major Large Language Models (LLMs) like GPT-4, even without reasoning capabilities, struggled, scoring as low as 4-5%. This significant gap highlighted the limitations of pre-training and the emergent importance of reasoning paradigms. The subsequent jump in performance to 21% with models like 01 preview demonstrated the impact of architectural advancements and reinforced ARC's value in identifying such breakthroughs.

Evolution of the ARC benchmark

The ARC benchmark has evolved through several iterations to better capture the nuances of general intelligence. ARC-AGI 1, released in 2019 by François Chollet, featured 800 manually created tasks designed to test learnability. In March 2025, ARC-AGI 2 was introduced as an upgraded, deeper version of the static benchmark, offering more complex challenges within the same framework. The most significant upcoming development is ARC-AGI 3, slated for release next year. This next iteration will introduce interactivity, moving beyond static problem sets to dynamic, game-like environments. This shift is driven by the belief that true intelligence is demonstrated through continuous interaction with an environment, involving taking actions and learning from feedback, mirroring real-world scenarios. This interactive nature is seen as a crucial step towards declaring AGI.

ARC-AGI 3: Interactive and instruction-less environments

ARC-AGI 3 represents a major leap forward by introducing interactive elements and removing explicit instructions. Instead of relying on textual or symbolic guidance, test-takers will be presented with approximately 150 video game-like environments where they must infer the goal through experimentation. By taking actions and observing the system's response, both humans and AI must learn the objective. Crucially, similar to previous versions, every ARC-AGI 3 game will be tested on a panel of general public members to establish a minimum solvability threshold. Games failing to meet this threshold will be excluded, ensuring that the benchmark remains accessible to ordinary humans while challenging AI. This ensures that success on ARC-AGI 3 will definitively indicate a level of intelligence that current AI systems lack.

Measuring efficiency beyond accuracy

While accuracy is a component, ARC-AGI is increasingly focused on measuring the 'efficiency' of intelligence. This encompasses not only the amount of training data required but also the computational resources and energy consumed. Unlike wall-clock time, which can be manipulated by throwing more compute at a problem, the number of data points and energy expenditure offer more fundamental metrics. For ARC-AGI 3, efficiency will be measured by comparing the number of actions an AI takes to solve a game against the average number of actions taken by humans. This approach directly contrasts with older methods, such as those used in the Atari days, which relied on brute-force solutions requiring millions of frames and actions. ARC-AGI 3 will normalize AI performance against human averages, penalizing inefficient, brute-force strategies and rewarding generalized problem-solving.

The path to AGI: necessary but not sufficient

Achieving a perfect score on ARC-AGI benchmarks, even ARC-AGI 3, is considered a necessary but not sufficient condition for AGI. If a model were to achieve 100% on the ARC-AGI benchmarks tomorrow, it would provide the most authoritative evidence to date of a system's generalization capabilities. However, the ARC Prize Foundation emphasizes that AGI involves potentially more complex and multifaceted abilities. The focus remains on understanding the failure points even in high-scoring systems and on developing benchmarks that can robustly identify true AGI when it emerges. Ultimately, the goal of the ARC Prize Foundation is to be equipped to reliably identify and declare AGI, ensuring that progress is accurately measured and genuinely transformative advancements are recognized.

Comparison of ARC AGI Benchmark Versions

Data extracted from this episode

VersionRelease YearTypeKey Feature
ARC AGI 12019Static800 tasks created by France
ARC AGI 22025StaticUpgraded version of ARC AGI 1
ARC AGI 3Next Year (2026)Interactive150 gamified environments, no instructions

Common Questions

The ARC Prize Foundation defines intelligence as the ability to learn new things efficiently. This is a shift from traditional benchmarks that focus on knowledge recall or problem-solving difficulty.

Topics

Mentioned in this video

Software & Apps
GPT-4

The base model of GPT-4, tested on the ARC benchmark in 2024, initially scored very low (4-5%), highlighting the challenge for LLMs on learning new tasks without specific pre-training.

ARC AGI 1

The initial version of the ARC AGI benchmark, released in 2019, proposed by France and comprising 800 tasks he created.

Opus 4.5

A model released by Anthropic, measured using the ArcGI benchmark as part of the trend of major AI labs adopting the benchmark for performance reporting.

Arc AGI 2

An upgraded version of the ARC AGI benchmark, released in March 2025, which is a deeper, but still static, benchmark compared to ARC AGI 1.

ARC AGI benchmark

A benchmark proposed by France to test an AI's ability to learn new things, designed such that normal humans can solve the tasks, unlike more complex PhD++ problems.

Gemini 3 Pro

A model from Gemini (likely Google's AI) measured using the ArcGI benchmark, indicating the benchmark's widespread use in reporting AI performance.

Gro 4

A model release from XAI that is being measured using the ArcGI benchmark, as part of the increasing adoption of the benchmark by major AI labs.

ARC AGI 3

The upcoming interactive version of the ARC AGI benchmark, set to be released next year, featuring 150 gamified environments with no instructions provided.

MMLU

A benchmark mentioned as an example of 'PhD++ problems' in AI, which current models are surpassing, in contrast to the ARC benchmarks that normal people can solve.

SAT test

Mentioned as a traditional example of how intelligence is often measured (e.g., scoring high), contrasting with the ARC benchmark's focus on learning ability.

More from Y Combinator

View all 562 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free