Key Moments
How Intelligent Is AI, Really?
Key Moments
Current AI benchmarks are easily gamed, but the ARC-AGI test evaluates true learning ability, revealing current models' significant limitations in generalization.
Key Insights
Intelligence is defined by the ability to learn new things efficiently, not just by scoring high on difficult tests.
Humans can score 100% on the ARC-AGI benchmark, while early GPT-4 base models scored only 4-5%.
ARC-AGI 2, released in March 2025, is a deeper version of the original static benchmark.
ARC-AGI 3, launching next year, will be an interactive benchmark using 150 video game environments with no instructions, testing adaptivity.
Future AGI will likely be declared using interactive benchmarks, reflecting real-world action-reaction dynamics.
ARC-AGI aims to measure efficiency not just by accuracy, but by the number of actions (AI vs. human) and energy consumed.
Intelligence redefined: learning over memorization
The traditional view of intelligence often equates it with scoring high on standardized tests or mastering complex tasks. However, François Chollet's proposed definition, forming the basis of the ARC Prize Foundation's work, redefines intelligence as the "'ability to learn new things." This perspective shifts the focus from sheer knowledge acquisition or computational power to the capacity for efficient learning and generalization. While AI excels at specific, well-defined tasks like chess or Go, mastering entirely new skills remains a significant challenge. The ARC benchmark, therefore, was designed not to test how hard a problem AI can solve, but how quickly and effectively it can learn to solve new, unseen problems, mirroring human learning capabilities. This has profound implications for how we measure AI progress, moving beyond "PhD++" problems that only push the boundaries of existing AI capabilities and towards assessments that reveal true adaptability.
ARC-AGI: A benchmark for generalization
The ARC benchmark, initially proposed in 2019, serves as a crucial tool for evaluating an AI's ability to generalize. Unlike many other benchmarks that focus on increasingly difficult problems solvable with brute-force computation or massive datasets, ARC problems are designed to be solvable by average humans. This ensures that performance on ARC is indicative of genuine reasoning and learning, rather than just a high degree of specialized training. Early results were stark: while humans could achieve near-perfect scores, major Large Language Models (LLMs) like GPT-4, even without reasoning capabilities, struggled, scoring as low as 4-5%. This significant gap highlighted the limitations of pre-training and the emergent importance of reasoning paradigms. The subsequent jump in performance to 21% with models like 01 preview demonstrated the impact of architectural advancements and reinforced ARC's value in identifying such breakthroughs.
Evolution of the ARC benchmark
The ARC benchmark has evolved through several iterations to better capture the nuances of general intelligence. ARC-AGI 1, released in 2019 by François Chollet, featured 800 manually created tasks designed to test learnability. In March 2025, ARC-AGI 2 was introduced as an upgraded, deeper version of the static benchmark, offering more complex challenges within the same framework. The most significant upcoming development is ARC-AGI 3, slated for release next year. This next iteration will introduce interactivity, moving beyond static problem sets to dynamic, game-like environments. This shift is driven by the belief that true intelligence is demonstrated through continuous interaction with an environment, involving taking actions and learning from feedback, mirroring real-world scenarios. This interactive nature is seen as a crucial step towards declaring AGI.
ARC-AGI 3: Interactive and instruction-less environments
ARC-AGI 3 represents a major leap forward by introducing interactive elements and removing explicit instructions. Instead of relying on textual or symbolic guidance, test-takers will be presented with approximately 150 video game-like environments where they must infer the goal through experimentation. By taking actions and observing the system's response, both humans and AI must learn the objective. Crucially, similar to previous versions, every ARC-AGI 3 game will be tested on a panel of general public members to establish a minimum solvability threshold. Games failing to meet this threshold will be excluded, ensuring that the benchmark remains accessible to ordinary humans while challenging AI. This ensures that success on ARC-AGI 3 will definitively indicate a level of intelligence that current AI systems lack.
Measuring efficiency beyond accuracy
While accuracy is a component, ARC-AGI is increasingly focused on measuring the 'efficiency' of intelligence. This encompasses not only the amount of training data required but also the computational resources and energy consumed. Unlike wall-clock time, which can be manipulated by throwing more compute at a problem, the number of data points and energy expenditure offer more fundamental metrics. For ARC-AGI 3, efficiency will be measured by comparing the number of actions an AI takes to solve a game against the average number of actions taken by humans. This approach directly contrasts with older methods, such as those used in the Atari days, which relied on brute-force solutions requiring millions of frames and actions. ARC-AGI 3 will normalize AI performance against human averages, penalizing inefficient, brute-force strategies and rewarding generalized problem-solving.
The path to AGI: necessary but not sufficient
Achieving a perfect score on ARC-AGI benchmarks, even ARC-AGI 3, is considered a necessary but not sufficient condition for AGI. If a model were to achieve 100% on the ARC-AGI benchmarks tomorrow, it would provide the most authoritative evidence to date of a system's generalization capabilities. However, the ARC Prize Foundation emphasizes that AGI involves potentially more complex and multifaceted abilities. The focus remains on understanding the failure points even in high-scoring systems and on developing benchmarks that can robustly identify true AGI when it emerges. Ultimately, the goal of the ARC Prize Foundation is to be equipped to reliably identify and declare AGI, ensuring that progress is accurately measured and genuinely transformative advancements are recognized.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Books
●Concepts
Comparison of ARC AGI Benchmark Versions
Data extracted from this episode
| Version | Release Year | Type | Key Feature |
|---|---|---|---|
| ARC AGI 1 | 2019 | Static | 800 tasks created by France |
| ARC AGI 2 | 2025 | Static | Upgraded version of ARC AGI 1 |
| ARC AGI 3 | Next Year (2026) | Interactive | 150 gamified environments, no instructions |
Common Questions
The ARC Prize Foundation defines intelligence as the ability to learn new things efficiently. This is a shift from traditional benchmarks that focus on knowledge recall or problem-solving difficulty.
Topics
Mentioned in this video
The base model of GPT-4, tested on the ARC benchmark in 2024, initially scored very low (4-5%), highlighting the challenge for LLMs on learning new tasks without specific pre-training.
The initial version of the ARC AGI benchmark, released in 2019, proposed by France and comprising 800 tasks he created.
A model released by Anthropic, measured using the ArcGI benchmark as part of the trend of major AI labs adopting the benchmark for performance reporting.
An upgraded version of the ARC AGI benchmark, released in March 2025, which is a deeper, but still static, benchmark compared to ARC AGI 1.
A benchmark proposed by France to test an AI's ability to learn new things, designed such that normal humans can solve the tasks, unlike more complex PhD++ problems.
A model from Gemini (likely Google's AI) measured using the ArcGI benchmark, indicating the benchmark's widespread use in reporting AI performance.
A model release from XAI that is being measured using the ArcGI benchmark, as part of the increasing adoption of the benchmark by major AI labs.
The upcoming interactive version of the ARC AGI benchmark, set to be released next year, featuring 150 gamified environments with no instructions provided.
A benchmark mentioned as an example of 'PhD++ problems' in AI, which current models are surpassing, in contrast to the ARC benchmarks that normal people can solve.
Mentioned as a traditional example of how intelligence is often measured (e.g., scoring high), contrasting with the ARC benchmark's focus on learning ability.
Environments for RL are discussed as a common approach in AI development, but the speaker cautions against them as a sole measure of progress, likening it to 'whack-a-mole' and emphasizing the need for generalization without predefined environments.
Defined as the ability to learn new things more efficiently, this is the core concept behind the ARC benchmark, contrasting with traditional measures of knowledge recall.
A leading AI research lab that uses the ArcGI benchmark for reporting model performance, signifying its adoption as a standard for evaluating AI capabilities.
Associated with Gemini 3 Pro in the context of AI model releases being measured by the ArcGI benchmark.
A company that released Opus 45, which is being evaluated using the ArcGI benchmark, showcasing the benchmark's increasing importance in the AI field.
A prominent AI lab that, along with OpenAI and others, is now using the ArcGI benchmark for reporting model performance, signifying its adoption as a standard for evaluating AI capabilities.
More from Y Combinator
View all 562 summaries
14 minInside The Startup Reinventing The $6 Trillion Chemical Manufacturing Industry
1 minThis Is The Holy Grail Of AI
40 minIndia’s Fastest Growing AI Startup
1 minStartup School is coming to India! 🇮🇳
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free