What is the AGI cognitive capacity framework discussed in the video?

The video references a framework based on Carroll's ten cognitive categories to score AI systems' cognitive abilities. It reports GPT-4 around 27% and GPT-5 around 58% on this scale, with caveats about physical dexterity and the fact that models don't remember long-term context.

What does Sora 2 do beyond typical AI models?

Sora 2 can answer benchmark-style questions and present the answer in video form, illustrating how visual or video-based outputs can encode model reasoning. This demonstrates advanced on-the-fly computation in video generation.

What is continual learning and why is it a challenge for AI?

Continual learning refers to AI systems' ability to remember and accumulate knowledge over time rather than relearn from scratch each interaction. The challenge is balancing learning from the world with safeguards to avoid unwanted or harmful learned behavior, plus the cost of context and compute.

What is ARC AGI 1 and ARC AGI 3's relevance to AGI testing?

ARC AGI variants are benchmark sets used to test models on induction and reasoning tasks. The video notes that ARC AGI 1 can yield strong GC5 results, while ARC AGI 3 can perform poorly in other setups, illustrating variability in these tests.

Who is Jerry Tuar and what did he discuss about online RL?

Jerry Tuar, OpenAI's VP of Research, discussed the concept of online reinforcement learning in which models learn in real time from user interactions. He cautioned about safety and control concerns, emphasizing safeguards before deploying such learning at large scale.

Key Moments

Did you miss these 2 AI stories? A Real LLM-crafted Breakthrough + Continual Learning Blocked?

AI Explained

Science & Technology4 min read15 min video

Oct 22, 2025|58,432 views|2,488|236

Save to Pod

Key Moments

TL;DR

Biology-driven LLMs spark breakthroughs; AGI debate and continual-learning limits loom.

Key Insights

A biology-focused LLM (C2S scale) generated a novel, testable cancer drug hypothesis that was validated in vitro, illustrating AI can meaningfully contribute to scientific discovery.

Frontier models compete on benchmarks: Gemini 3 is anticipated soon; Gemini 2.5 DeepThink leads on Frontier Math; Claude Code and CodeEx are strong in coding tasks, with real-world caveats like occasional errors.

Memory and continual learning remain fundamental limits: models forget between interactions, raising cost and context challenges; online RL approaches raise safety concerns without robust safeguards.

A formal AGI definition using Carroll’s cognitive capacity framework is proposed, breaking cognition into 10 factors with equal weighting, but it’s not yet a universal or conclusive benchmark.

Sora 2 demonstrates cross-modal capabilities by answering benchmark questions as video outputs, highlighting progress in video-generation models that reason on the fly, though still not at the level of specialized models.

Industry dynamics and funding pressures persist: compute is often diverted toward monetizable features, while researchers hope for deeper frontier gains; sponsorships like Assembly AI are highlighted as accelerants.

BIOLOGICAL LANGUAGE MODEL BREAKTHROUGH

A novel biology-focused language model, C2S scale, demonstrates that LLMs can learn to read biology like text and generate testable hypotheses. Built on the Gemma lineage (Gemma 2 architecture with Gemma 3 and Gemma 4 in the pipeline), it uses reinforcement learning rewards to predict how cells will respond to interferon and other drugs. By converting each cell’s gene activity into short sentences, the model reads and reasons about biology in a way that leads to a new drug candidate not previously documented in the literature. Importantly, the candidate Sil Mittertib showed in vitro activity on human cells, marking a notable, testable AI-driven step toward drug discovery. The authors emphasize this is a blueprint for a new discovery paradigm, not a completed clinical path. While promising, the work remains far from human trials, and the broader implications point to a future where AI accelerates biology alongside traditional experimentation.

BENCHMARKS, GEMINI, AND CODEX WARS

The video shifts to frontier-performance dynamics, noting Gemini 3 from Google/DeepMind is expected within two months, with Gemini 2.5 DeepThink pushing Frontier Math benchmarks. The host reports on personal testing, including Simple Bench and CodeEx comparisons, and observes Gypsy 5 Pro performing competitively against Gemini 2.5 Pro. Claude Code is discussed, including anecdotes about occasional missteps (e.g., accidental deletion of code) that underscore the ongoing need for robust evaluation. The narrative frames progress as a competition among major labs, with both software and hardware constraints shaping outcomes, and highlights that code-focused models like CodeEx remain a strong pillar, while API access limits restrict some comparisons (e.g., DeepThink usage).

CONTINUAL LEARNING: MEMORY LIMITS AND COSTS

A central theme is the tension between context awareness and continual learning. Models today can remember within a conversation, but lack true long-term memory across sessions, forcing costly retraining or context expansion. An OpenAI spokesperson (Jerry Tuar) discusses online reinforcement learning in principle, noting the risks of training models through user interactions without safeguards. The speech emphasizes that naive online learning could embed harmful behaviors or misalignment, hence the need for strong safeguards before any large-scale online adaptation. The section culminates with a nod to Sora 2’s capabilities, foreshadowing how multimodal models might someday address memory and learning more effectively.

A NEW AGI DEFINITION: COGNITIVE CAPACITY FRAMEWORK

The video surveys a paper that advocates a formal AGI definition grounded in Carroll’s cognitive capacity framework, described as the most empirically validated model of human cognition. The authors distill cognition into 10 factors, each weighted at 10% toward an overall 100-point AGI score. Areas include general knowledge, reading, mathematics, and on-the-spot reasoning, while long-term memory and memory storage receive particular emphasis because current models struggle with continual learning. The proposed scores (GPT-4 around 27%, GPT-5 around 58%) are presented to illustrate progress, not to declare AGI achieved. Physical dexterity is explicitly excluded, underscoring the theoretical nature of the measure.

SORA 2: VIDEO-BASED BENCHMARK CAPABILITIES

A striking note is the claim that Sora 2 can answer benchmark-style questions in a video format, effectively turning reasoning into a generated video that scores highly on certain tasks. While not yet superior to specialized, focused models, this demonstrates evolving cross-modal capabilities where reasoning is expressed visually. The host uses this as evidence of the physics-like calculations these video generators perform on the fly. The broader implication is that multimodal models might begin to approximate reasoning in ways that extend beyond text, offering new tools for education, demonstration, and assessment.

INVESTMENTS, SPONSORSHIPS, AND THE PATH AHEAD

Toward the end, the narrative returns to the practicalities of AI development: compute budgets are often redirected toward monetizable features like browsers and video content, even as real frontier progress continues in the background. A long-standing sponsor, Assembly AI, is highlighted for its universal streaming tool, emphasizing rapid improvements in transcription and recognition as a proxy for real-world utility. The speaker references Google/DeepMind’s quantum-related Nature publication as part of the broader trajectory toward drug discovery and future applications. The closing sentiment is aspirational: a ramp back toward frontier intelligence and a future shaped by ongoing innovation.

Mentioned in This Episode

●Software & Apps

●Tools

●Studies Cited

●People Referenced

Common Questions

C2S scale is a language-model system that translates each cell's gene activity into text and can predict how the immune system will respond to a drug. It produced a novel, testable drug hypothesis for cancer treatment that wasn't previously described in literature, with in vitro validation reported in the study.

Topics

Industry Context Biotech AI C2S Scale Gemma Model Family Sora 2 Gypsy 5 Pro AR C AGI Online RL

Mentioned in this video

Concepts

C2S scale

A language model that converts each cell's gene activity into a short sentence to help predict how the immune system will respond to drugs; used to generate a novel cancer drug hypothesis.

Software & Apps

Gemma 2 architecture

Google's Gemma 2 architecture, the open-weight model that C2S scale builds upon.

Gemma 3

Gemma 3 has been released since Gemma 2; part of the same family of models.

Gemma 4

Gemma 4 is due any time, continuing the Gemma lineage.

Gypsy 5 Pro

A competing AI model evaluated in Simple Bench comparisons.

Gemini 2.5 Deep Think

A Deep Think variant of Gemini 2.5 highlighted for Frontier Math performance.

Sora