How does Francois Chalet define intelligence?

Francois Chalet defines intelligence not by performance on a single test, but by the ability and efficiency to learn new things, especially on unseen tasks. This involves skill acquisition and the efficiency with which it occurs, considering factors like energy and data required.

What's new in ARC-AGI V3 compared to previous versions?

ARC-AGI V3 introduces 100 novel interactive environments, primarily in the form of simple 2D games. This moves from static benchmarks to interactive ones, requiring agents to explore, plan, and learn game mechanics on the fly, mimicking human-like learning efficiency.

How are the ARC-AGI V3 games designed?

The games are designed to be easy for humans to intuit but challenging for current AI. Each level often introduces a new mechanic to test sample-efficient learning. The API provides frames (visual data) and expects integer actions as output, with a click action requiring coordinates.

What is the role of scaffolding in AGI development according to ARC-AGI?

ARC-AGI hypothesizes that AGI will be heavily scaffolded, requiring different components to work together, such as memory systems beyond just context windows. Their agent competition aims to discover the minimum viable 'harness' needed to test different models effectively.

How does ARC-AGI measure efficiency for AI models?

Beyond energy and training data, ARC-AGI V3 introduces 'action efficiency' using interactive games. They measure how many actions an agent takes to complete goals, assessing both learning and execution capabilities, similar to efficiency charts comparing human and AI players.

What is ARC-AGI's timeline and durability estimate for benchmarks?

ARC-AGI V3 is planned for Q1 2026 with an estimated durability of 3 years (36 months). They believe benchmarks like ARC-AGI 2 are unlikely to be 'beaten' by base LLMs in the near future, expecting models like GPT-5 to struggle with it.

How will ARC-AGI declare AGI if achieved?

ARC-AGI aims to be the entity that validates AGI when it arrives. Their definition is when an artificial machine matches human learning efficiency and generalization. They believe this will be a recognizable 'before and after' moment, not just incremental progress.

What was Greg Comrade's experience at the Grok 4 launch event?

Greg attended the Grok 4 launch, validating its benchmark scores with ARC-AGI. He observed high energy among XAI employees and had a conversation with Elon Musk, pitching ARC-AGI V3's interactive games, which piqued Musk's interest.

How can people get involved with ARC-AGI?

ARC-AGI is seeking agent builders for their upcoming competition, offering a $10,000 prize pool. They encourage participants to build agents using any tools or methods to tackle the new games in ARC-AGI V3.

Key Moments

⚡️ARC-AGI-3: The Interactive Reasoning Benchmark

Latent Space Podcast

Science & Technology3 min read40 min video

Jul 18, 2025|3,948 views|102|19

Save to Pod

Key Moments

TL;DR

ARC-AGI-3 benchmark to feature 100 interactive games testing AI skill acquisition efficiency and generalization.

Key Insights

ARC Prize Foundation aims to be the "northstar" for AGI research through benchmarks.

Intelligence is defined as skill acquisition efficiency, measured by energy and data input relative to output.

ARC-AGI 3 shifts from static benchmarks to interactive, game-based environments to test real-time planning and generalization.

The new benchmark will feature 100 novel, simple 2D games designed to be easy for humans but challenging for AI.

ARC-AGI 3 introduces 'action efficiency' as a new metric, measuring the number of actions needed to achieve goals.

The foundation is running a $10,000 agent competition for ARC-AGI 3 to encourage community participation and innovation.

THE MISSION OF ARC PRIZE FOUNDATION

The ARC Prize Foundation operates as a nonprofit dedicated to guiding Artificial General Intelligence (AGI) research towards a clear goal. Their primary method for incentivizing progress is by developing and deploying sophisticated benchmarks. These benchmarks act as tangible targets, directing the efforts of the AI research community. The foundation builds upon the initial work of Francois Challet, who in 2019, conceptualized intelligence not merely as skill mastery but as the efficiency of skill acquisition. This philosophy forms the bedrock of their approach to measuring AI's true potential.

REDEFINING INTELLIGENCE: SKILL ACQUISITION EFFICIENCY

Francois Challet's definition of intelligence centers on an agent's ability to learn new things, particularly on unseen tasks, rather than excelling at pre-defined ones like chess or Go. He terms this 'skill acquisition efficiency.' This metric considers the resources required to learn, specifically the energy consumption and the amount of training data needed. By using humans as the benchmark for general intelligence, ARC Prize emphasizes that true intelligence is measured against our own biological efficiency, which requires significantly less data and energy compared to current AI models, highlighting a crucial gap.

TRANSITIONING TO INTERACTIVE BENCHMARKS: ARC-AGI 3

ARC-AGI 3 marks a significant evolution from static benchmarks to interactive environments. The core of this new benchmark will be a collection of 100 novel, relatively simple 2D games. These games are intentionally designed to be intuitive for humans but pose substantial challenges for AI, requiring exploration, planning, and understanding of dynamic rules. The hypothesis is that AGI will be declared through an interactive benchmark, as these environments demand the kind of long-horizon planning and environmental intuitiveness that static tests cannot capture.

GAME DESIGN AND MECHANICS FOR ARC-AGI 3

The games in ARC-AGI 3 are not arbitrary; each level is engineered to introduce a new game mechanic, testing an AI's on-the-fly learning capabilities. A prime example, 'Locksmith,' demonstrates the need for exploration, resource management (like 'life'), and multi-step problem-solving involving matching, rotation, and color-swapping. The AI receives a grid of numbers representing game states and must output discrete actions. The developers are agnostic to how an agent processes this data, whether visually or through other modalities, focusing solely on the efficiency and success of the learned behavior.

NEW METRICS AND COMMUNITY INVOLVEMENT

Beyond traditional metrics like cost and training data, ARC-AGI 3 introduces 'action efficiency' as a crucial new measure. This assesses how many actions an agent takes to achieve a goal, differentiating efficient learners from brute-force or random approaches. To foster innovation and gather diverse solutions, ARC Prize is launching a $10,000 agent competition. This challenge encourages participants to build agents using any method (RL, LLMs, etc.) and will evaluate performance based on generalization, with a higher weighting on private test sets to prevent overfitting, and will also highlight top-performing agents across their social channels.

BROADER IMPLICATIONS AND FUTURE ROADMAP

The ARC Prize benchmark is seen as a tool to accelerate research, not just to be 'beaten.' The insights gained from agents performing well on ARC-AGI can be applied to other domains. While ARC-AGI 3 focuses on single-agent interactions and scoped environments, future iterations (ARC-AGI 4 and beyond) aim to incorporate more complex elements, potentially including cooperative tasks and expanded dimensions beyond 2D grids, moving closer to simulating reality. The ultimate goal is to identify when artificial machines can match human learning efficiency and generalization, marking the arrival of AGI.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

ARC-AGI V3: Key Design Principles

Practical takeaways from this episode

Do This

Focus on human-doable yet AI-challenging problems.

Incorporate interactive benchmarks for AGI assessment.

Design games and environments that test generalization and skill acquisition efficiency.

Measure learning efficiency using metrics like actions per level completed.

Encourage exploration, long-term planning, and understanding of environment rules.

Consider incorporating cooperation and alignment mechanics in games.

Incentivize agent development through competitions and research.

Emphasize the role of humans as the benchmark for general intelligence.

Keep benchmarks relevant for several years (e.g., V3 estimated durability of 3 years).

Avoid This

Don't create benchmarks that are PhD++ level (unnecessarily complex or niche).

Don't rely solely on static benchmarks for AGI declaration.

Don't inject human intelligence directly into AI training environments.

Don't assume AI will benefit from visual input if it hasn't been proven in specific contexts.

Don't over-incentivize AI to reverse-engineer the game creation process itself.

Don't define AGI solely by profit or financial metrics.

Don't underestimate the importance of interaction and exploration in AGI testing.

Common Questions

ARC-AGI, represented by the Arc Prize Foundation, aims to be a northstar towards Artificial General Intelligence (AGI). They build benchmarks, like ARC-AGI V3, to measure intelligence by focusing on skill acquisition efficiency, energy input, and training data needed, using humans as the primary benchmark.

Topics

AGI Benchmarks Interactive Environments Sample Efficiency ARC Prize Foundation

Mentioned in this video

Books

Measure of Intelligence

A paper by Francois Chalet that defined intelligence and led to the creation of the ARC AGI benchmark.

Organizations

ARC AGI Foundation

Nonprofit organization focused on AGI, aiming to set benchmarks for intelligence.

ARC Prize Foundation

Sponsor of the ARC AGI challenge benchmark, which evolved into a full-blown nonprofit.

People

Mike Koop

Co-founder of Arc Prize, who incentivized the benchmark with a million-dollar bounty.

Nam Brown

Researcher whose opinion on scaffolds in AGI was mentioned.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free