Key Moments

AI News: The Biggest Leap We've Seen This Year!

Matt WolfeMatt Wolfe
Science & Technology7 min read43 min video
Apr 24, 2026|106,877 views|3,518|249
Save to Pod
TL;DR

OpenAI's GPT-5.5 is 83% better at terminal tasks than Anthropic's unreleased 'too scary' model, but its API costs have doubled, and mainstream users may not notice a difference.

Key Insights

1

GPT-5.5 achieves an 82.7% on Terminal Bench, surpassing Mythos (82%) and GPT-5.4 (75%), making it better at running terminal commands than the model Anthropic refused to release.

2

GPT-5.5's API pricing has doubled to $5 per 1 million input tokens and $30 per 1 million output tokens compared to GPT-5.4's $2.50 and $15 respectively.

3

GPT Image 2.0 has become the top-ranked image model on LM Arena with a score of 1500, a significant jump from Nano Banana's 1271, demonstrating improved performance in blind taste tests.

4

Claude Design can now create animations, with examples shown of basic animations for Las Vegas highlights, convention center scenes, and bar graphs, which used to take hours in After Effects but can now be generated with a few prompts.

5

Google DeepMind's Deep Research Max model is presented as the state-of-the-art for autonomous research tasks, outperforming existing models on research-specific benchmarks.

6

Four different robots completed a half marathon in China in under an hour, with one robot's real-time speed recorded as faster than any human marathon runner.

GPT-5.5 demonstrates significant gains in coding and reasoning

OpenAI has released GPT-5.5, a new model accessible to premium ChatGPT and Codex users, which excels at understanding user intent with less context and performs tasks like coding, research, and data analysis more efficiently. A key improvement is its token efficiency, using significantly fewer tokens for the same tasks, though this is juxtaposed with a doubling of API pricing: $5 per 1 million input tokens and $30 per 1 million output tokens, compared to GPT-5.4's $2.50 and $15. Benchmarks show GPT-5.5 scoring 82.7% on Terminal Bench, outperforming GPT-5.4 (75%) and even Anthropic's unreleased 'Mythos' model (82%). It also scored 78.7% in operating system tasks and performed well in math and science. While many everyday users might not notice a drastic change in conversational AI, its enhanced ability to handle vague prompts and infer user needs is a significant development. For instance, when asked for a 'plan to be healthier' with minimal context, GPT-5.5 provided a highly personalized plan based on past interactions, unlike the generic response from GPT-5.4. This improved context awareness extends to coding tasks, where GPT-5.5 generated a more robust and interactive website describing its capabilities compared to GPT-5.4's less polished output. The model's improved 'doing more with less' capability means it can deliver better results with simpler prompts, and even more impressive results with detailed ones. This leap in capability, especially in understanding and executing complex tasks from minimal input, signals a shift towards more intuitive and powerful AI assistants.

GPT Image 2.0 redefines AI image generation

OpenAI's new image model, GPT Image 2.0, is making waves, with LM Arena rankings showing it far surpassing previous leaders like Nano Banana (Gemini 3.1 Flash Image). GPT Image 2.0 achieved a score of 1500, a substantial leap from Nano Banana's 1271, indicating superior performance in blind taste tests. This new model boasts enhanced capabilities, including the accurate rendering of dense text within images, a significant improvement over prior iterations. It feels less 'AI-generated' and shows accuracy across languages, utilizing world knowledge to fill gaps and even search the web for real-time information to inform image creation. Examples highlight its ability to create complex collages, generate realistic magazine pages with dense text, and produce highly detailed infographics. Demonstrations include a 360-degree equirectangular image featuring prominent tech figures, a magazine page for 'Echoes' with realistic imagery, and an impressive comic book page. A particularly notable feat showcased by Riley Brown on X involved generating book covers ('Good to Great,' 'The Intelligent Investor') with scannable barcodes that accurately linked to the respective books, even when the numbers were obscured, proving the model's advanced understanding of real-world elements. While some comparisons suggest Nano Banana Pro might still edge out GPT Image 2.0 in certain aspects of realism, the overall advancements in text rendering, detail, and context-aware generation mark a significant step forward.

Claude Design offers new visual collaboration tools

Anthropic has launched Claude Design, a feature enabling users to collaborate with Claude to produce visual content such as designs, prototypes, and presentations. Available to Claude Pro, Max, Team, and Enterprise users, it leverages the Opus 4.7 vision model and integrates directly into the Claude interface. While examples include realistic prototypes, wireframes, and pitch decks, the platform shows particular promise in generating animations—a feature not heavily emphasized in its initial announcement. The speaker showcased how Claude Design could reimagine the 'Future Tools' website, creating an animated and interactive redesign. While the aesthetic is consistent across various uses, with some designs feeling slightly busy, the capability to generate animations that mimic After Effects-level quality with simple prompts is a major highlight. Examples include animated Las Vegas maps, convention center scenes with event titles, and dynamic bar graphs showing yearly AI mentions at NAB 2026. These animations, which previously might have taken hours in After Effects, can now be generated in minutes, offering a powerful tool for content creators. Another Anthropic release, 'live artifacts' in co-work, allows for the creation of dynamic dashboards and trackers connected to apps and files, promising to refresh with current data upon opening, though this feature requires more extensive testing.

New AI models and developer tools emerge

This week saw the release of several new large language models and developer tools. Google DeepMind introduced Deep Research Max, an autonomous research agent positioned as state-of-the-art for research tasks. Alibaba launched Quinn 3.6 Max Preview, a proprietary model with enhanced agentic coding and instruction following, alongside the open-source Quinn 3.6 27B, which claims outstanding agentic coding capabilities and strong reasoning. Kenna K2.6, another open-source coding model, supports agent swarms and proactive agents, demonstrating competitive performance against models like Opus 4.6 and GPT-5.4 in certain benchmarks. OpenAI also released an open-weight model, OpenAI Privacy Filter, designed for masking personally identifiable information (PII) locally and efficiently, and ChatGPT for Clinicians, a free tool for verified US clinicians to assist with documentation and research. Anthropic expanded Claude's connectivity with new integrations for everyday apps like Instacart and Audible, and Microsoft continued to enhance Copilot's multi-step action capabilities in Word, Excel, and PowerPoint. X introduced custom timelines powered by Grok for personalized content feeds, while HeyGen's HyperFrames feature allows the creation of MP4 animations using Claude code. Ideogram introduced custom model training, enabling users to create models in their specific art style.

Warp enhances its terminal-based development environment

Warp, a terminal emulator, has introduced significant updates aimed at developers using AI agents. The platform now boasts universal agent support, allowing users to run various agents like Claude Code and Codex within a single environment without altering their workflow. Warp transforms the terminal into an agentic development hub, enabling side-by-side monitoring of multiple agents—for example, one writing code while another debugs. New features include a code review loop directly within the terminal, where agents can instantly update code based on inline comments, and a unified notification system that alerts users only when their attention is required, reducing the need for constant monitoring. These updates are designed to streamline the development process and make managing AI agents more efficient.

Controversy and insights surrounding Anthropic's Mythos model

The highly anticipated, yet unreleased, AI model Mythos from Anthropic has been at the center of controversy. Despite Anthropic's decision not to release it due to its perceived power and potential risks, unauthorized users reportedly gained access. This situation has drawn commentary, including from Sam Altman, who likened Anthropic's marketing of Mythos to selling "bomb shelters" for millions, suggesting the 'too scary to release' narrative could be a marketing tactic. While Anthropic claims no evidence suggests the unauthorized access has impacted its systems, the incident highlights the challenges of controlling powerful AI models and the public's intense curiosity regarding advanced AI capabilities.

Robotics advance with marathon completion

In a display of robotic prowess, four different robots successfully completed a half marathon in China in under an hour. One robot was recorded running faster than any human marathoner. Videos showcasing these robots revealed a variety of designs, including a bipedal robot and one resembling a plush toy. While some robots navigated the course smoothly, others encountered issues, such as one failing to clear an obstacle and another moving in the wrong direction. The event highlights the growing capabilities of humanoid and other advanced robots in complex physical tasks.

Common Questions

GPT-5.5 offers improved understanding with less context, handles more work independently, excels in tasks like coding and research, and is more efficient due to using fewer tokens. However, its pricing has doubled compared to GPT-5.4.

Topics

Mentioned in this video

Software & Apps
GPT-5

A new AI model from OpenAI that understands prompts with less context, excels at various tasks, and is more efficient and capable than its predecessor.

ChatGPT

Platform where GPT-5.5 is available for plus, pro, business, and enterprise users, and where the speaker tested its personalized health plan capabilities.

Codex

A platform where GPT-5.5 is available and its features for coding tasks are highlighted.

GPT-5.4

The previous generation model, used as a comparison point for GPT-5.5's pricing, performance, and website generation capabilities.

Claude Opus

Competitor AI model mentioned in benchmarks, with Claude Opus 4.7 scoring lower than GPT-5.5 on Terminal Bench and SweBench Pro.

Mythos

An AI model by Anthropic that was deemed too scary to release, but GPT-5.5 reportedly performs better than it on Terminal Bench. Later mentioned as having been accessed by unauthorized users.

Excel

Mentioned as a tool where GPT-5.5 can perform financial modeling and as a Microsoft application where Copilot has gained agentic capabilities.

Gemini

A Google model that was part of a three-way tie for the leader on the Artificial Analysis Intelligence Index before GPT-5.5.

Claude 4.7 Opus

Mentioned as a competitor model that previously tied for leadership on the Artificial Analysis Intelligence Index and scored lower than GPT-5.5 on benchmarks.

Warp

A company that released new features for its terminal, including universal agent support and a code review loop, aimed at improving developer workflows.

Claude Code

Mentioned as an agent that can be run within Warp's environment and used in conjunction with HeyGen's HyperFrames for animation creation.

Open Code

An agent that can be run within Warp's environment, alongside Claude Code and Codex.

ChatGPT Images 2.0

OpenAI's latest image generation model, which is a significant improvement over previous versions and reportedly better than Nano Banana.

Nano Banana

A model (also known as Gemini 3.1 Flash Image) previously dominating image generation rankings, now surpassed by ChatGPT Images 2.0.

Gemini 3.1 Flash Image

Another name for the Nano Banana model, which was previously a top-ranked image generation model.

LM Arena

A platform used for comparing image models through blind taste tests, which informs the rankings of models like Nano Banana and ChatGPT Images 2.0.

Slack

Mentioned as having a logo that appeared with incorrect coloration in an AI-generated image of a Mac OS X desktop.

Claude

Anthropic's AI model, discussed for its design capabilities (Claude Design), connectors, and integration with Microsoft Word.

Claude Design

A feature by Anthropic that allows collaboration with Claude to create visual work, including animations.

Opus 4.7

Mentioned as the vision model used by Claude Design, and also as a competitor to newer models.

Future Tools

The speaker's website, which Claude Design was used to redesign, showcasing its capabilities in creating interactive websites and animations.

Quinn 3.5 397B A17B

An older model from Alibaba that is surpassed by the new Quinn 3.6 27B model in reasoning and coding tasks.

Microsoft Word

A platform where Claude is now available for Pro or Max plan users, and where Microsoft Copilot has enhanced agentic capabilities.

Quinn 3.6 Plus

The previous version of Alibaba's Quinn model, used as a comparison point for Quinn 3.6 Max Preview's improved capabilities.

Google Drive

An example of a service that could be connected to Anthropic's 'live artifacts' feature to create dynamic dashboards.

Microsoft Copilot

Enhanced with agentic capabilities in Word, Excel, and PowerPoint, allowing multi-step app-native actions.

PowerPoint

Microsoft application where Copilot can perform multi-step actions and generate new content.

Grok

An AI technology powering X's new custom timelines feature, which personalizes content based on user interests.

HeyGen

A company that released the HyperFrames feature, enabling animation creation using Claude code.

AllTrails

One of the everyday life connectors available for Claude, allowing interaction with the app through Claude.

Intuit TurboTax

One of the everyday life connectors available for Claude, allowing interaction with the app through Claude.

HyperFrames

A feature from HeyGen that uses Claude code to create animations, offering a simpler alternative to After Effects for basic animations.

Google Calendar

An example of a service that could be connected to Anthropic's 'live artifacts' feature to create dynamic dashboards.

Ideogram

An image generation tool that now allows users to train custom models on their own images to guide the art direction of new generations.

OpenAI Privacy Filter

A state-of-the-art, open-weight model for masking personally identifiable information (PII) that can be run locally.

Quinn 3.6 Max Preview

A proprietary model from Alibaba with enhanced agentic coding, world knowledge, and reliability.

Opus 4.6

A previous version of Anthropic's model, benchmarked against Kimmy K2.6, showing that the open-source model can outperform it.

Deep research

A new autonomous research agent model from Google DeepMind, described as state-of-the-art for research tasks.

Gmail

An example of a service that could be connected to Anthropic's 'live artifacts' feature to create dynamic dashboards.

Quinn 3.6 27B

An open-source model from Alibaba that excels at agentic coding and reasoning, surpassing older models.

Kimmy K2.6

An open-source coding model that performs well in long horizon coding, agent swarms, and even outperforms some state-of-the-art models on benchmarks.

GPT-5.4 extra high

A previous version of OpenAI's GPT model, benchmarked against Kimmy K2.6, showing that the open-source model can outperform it.

ChatGPT for clinicians

A free version of ChatGPT offered by OpenAI for verified clinicians in the US to assist with clinical tasks.

More from Matt Wolfe

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free