Key Moments

AI Dev 26 x SF | Paige Bailey: What's New and What's Next in AI

DeepLearning.AIDeepLearning.AI
Education6 min read43 min video
May 20, 2026|255 views|11
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

Google launched a suite of powerful AI models, including multimodal Gemini and open-source Gemma, enabling complex tasks like video analysis and app generation with surprising cost-effectiveness.

Key Insights

1

Gemini 3.1 Flashlight can analyze 5 minutes of YouTube video (including frames and audio) for approximately $0.015.

2

The Gemma 4 open model family includes a 2 billion parameter version small enough to run on mobile devices.

3

Building a brand identity generator app via voice in AI Studio, including logo, brand voice, and design, takes minutes.

4

VO3.1 Light can generate realistic video with audio at a 'fraction of the cost' of previous models.

5

A DIY commercial recreating a Chick-fil-A ad using VO3.1 took less than a dollar and a couple of minutes.

6

Developers on the free tier get over 10,000 requests per day to the Gemma models via API.

Rapid release of multimodal and generative AI models

Google DeepMind has been exceptionally busy, releasing a constant stream of new AI models and features over the past year, especially in the last few months. This includes notable releases like Gemini 3.1 Flash Live for real-time interaction, Gemini 3.1 Pro (their largest model available via API), and Flashlight (their smallest but powerful API model). Other advancements cover image generation and editing with Nano Banana 2, a unified embedding space for video, audio, images, and text, music generation with Liria 3, world model building with Genie 3, a full-stack runtime for AI Studio named 'genie,' open models with Gemma 4, and cost-effective realistic video generation with VO3.1 Light and Gemini 3.1 text-to-speech for hyper-realistic audio outputs across various languages and styles. This broad spectrum covers large language models, multimodal capabilities, and generative media, showcasing a significant pace of innovation.

Gemini's native multimodality and AI Studio's capabilities

Gemini models are natively multimodal, meaning they can process and understand various data types simultaneously: video, images, audio, text, and code. Crucially, they can also *output* multiple modalities, including text, code, audio tokens, and frame-by-frame images that can be stitched into video. Gemini supports function calling and native tools like code execution and grounding with Google Search. Google AI Studio offers a free platform to experiment with these models using a personal Gmail account. Users can enable features like code execution, function calling, Google Search grounding, Google Maps grounding, and URL context for retrieval.

Analyzing video content with Gemini for pennies

A compelling demonstration involved analyzing a 5-minute segment of a YouTube video about dinosaurs. By grounding the model with the video URL and setting a specific time frame, Gemini 3.1 Flashlight, with search grounding enabled and minimal thinking level, processed the video's frames and audio. The resulting metadata, including a table of dinosaurs with timestamps and fun facts, was generated efficiently. The analysis of this 5-minute video segment, sampled at one frame per second with associated audio tokens, consumed approximately 39,900 tokens and cost around $0.015. This illustrates the model's ability to extract detailed information from video content at a remarkably low cost and high speed. The 'get code' feature in AI Studio generates the necessary Python or TypeScript code to replicate such tasks, streamlining integration into projects.

Comparing AI models and cost-effectiveness

AI Studio's 'compare mode' allows direct head-to-head comparisons of different models. In a test bounding boxes around green Lego bricks, Gemini 3.1 Flashlight quickly provided accurate results. Notably, the cost for this task was a tiny fraction of a penny. The transcript highlights that analyzing a full video with understanding of frames, audio, and context can be done for mere pennies. This cost-effectiveness, combined with the ability to perform complex tasks like segmentation masks or detailed object detection (e.g., specific car models), underscores the value of these advanced models for developers and businesses looking to process large amounts of multimodal data efficiently.

Gemini 3.1 Flash Live for real-time interaction

Gemini 3.1 Flash Live enables real-time, conversational interactions. It supports sharing desktops or video feeds and integrates tools like reasoning, Google Search grounding, and function calling. Demonstrations included live screen sharing where the model could identify and describe Python code snippets, and even translate its responses into Spanish and Hindi. It also handled a video feed, accurately identifying fingers held up and describing the background. System instructions can be modified to enforce specific response behaviors, such as responding only in German, as shown with a query about an AI dev conference and Deep Learning.ai.

Gemma 4: Open-source models for diverse applications

Gemma 4 is Google's latest family of open models, available in four sizes: 2B, 4B, and two larger 'moderate-sized' versions. The 2B parameter version is small enough to run on mobile devices, while the 4B fits on a laptop. These models are Apache 2.0 licensed, making them easily usable for commercial applications. Performance is strong, with the 26B mixture of experts and 31B dense models showing competitive results in chatbot arenas. They support multimodal tasks (audio, video, images) similar to Gemini. Developers on the free tier receive over 10,000 requests per day to Gemma models via API, offering a cost-effective way to integrate these capabilities. Additionally, AI Studio offers a 'build' feature to create apps via voice, with options for starting from scratch, using a 'feeling lucky' function, or generating tab completions.

Streamlining app development with AI Studio 'Build'

The 'build' feature in AI Studio simplifies app creation, allowing users to describe their desired app via voice. It can generate example apps or assist with feature suggestions. The process involves defining the app's purpose, theme, and desired aesthetics. Users can select models like 'Gemini 3.1 Pro Preview' or 'Gemini 3 Flash' and configure settings, microphone sources, and system instructions. Secrets management, version control, and GitHub integration are supported. The generated code is visible as the app is built, and apps can be published directly or shared via URL. The gallery showcases diverse app examples, including those using LIA 3 for music generation, Nano Banana 2 for design, and VO models for video. The brand identity generator example demonstrates quick creation of logos, brand voices ('energetic, playful, fiercely magical'), and content generators inspired by user inputs and mood boards.

Advanced applications in robotics, AR, and video generation

Gemini APIs are applicable to robotics, enabling real-time interaction with robotic actions, whether on-device or triggered remotely. For augmented reality (AR), Gemini Live can dynamically explain environments, provide directions, and perform real-time translation. The VO3.1 series, particularly VO3.1 Light and Flash, offers cost-effective realistic video generation. This includes capabilities like reference-powered video, animated images with guidance, natural language camera control (moving, rotating, zooming), outpainting/inpainting scenes, and character control. Recreating a commercial using these models has become vastly more efficient: a process that took 20 minutes with older tools and nearly $1 with VO3, now takes minutes and costs under a dollar with VO3.1. This hyper-personalization and rapid generation of video content is becoming increasingly accessible.

Common Questions

Google DeepMind has recently released several advanced AI models including Gemini 3.1 Flash Live for real-time interaction, Gemini 3.1 Pro, and various generative models like Nano Banana 2 for images and Liria 3 for music. They also offer open models like Gemma and video generation models in the VO3.1 series.

Topics

Mentioned in this video

Software & Apps
Liria 3

An AI model for music generation.

Flashlight

A smaller but very powerful AI model from Google, available via API.

Nano Banana 2

An AI model for creating and editing images, with interleaved text capabilities.

Embeddings model

A new model allowing the use of video, audio, images, and text in a single embedding space.

AI Studio

A runtime environment for generating AI applications, incorporating OAUTH and database support.

Gemini models

Natively multimodal AI models from Google that can understand and output various data types like video, images, audio, text, and code.

Python

A programming language used for code execution within Gemini's sandbox environment in AI Studio.

Typescript

A programming language that can be used to export and utilize code generated from AI Studio.

Gmail

A personal email service from Google that can be used to access AI Studio for free.

Google Maps

A mapping service that can be used for grounding AI model outputs within AI Studio.

Java

A programming language that can be used to export and utilize code generated from AI Studio.

Gemma

An open-source AI model family from Google, available in four sizes and licensed under Apache 2.

LIA

A music generation model used for creating music clips in various styles and languages.

Lovable

A tool mentioned as being used by people to build revenue-generating companies, alongside AI Studio.

Google Search

A search engine that can be integrated with AI models for grounding and providing cited sources.

Build

A feature within AI Studio that allows users to create applications using voice commands or automated generation.

Genie 3

A platform for world model building, allowing users to describe and experience virtual environments.

Cloud Run

A Google Cloud service where applications built with AI Studio Build can be published.

Camtasia

A video editing software used in earlier methods for creating commercials with AI-generated components.

Movie Pie

A video editing software mentioned alongside Camtasia for stitching together AI-generated video elements.

Gemini 2.5

An AI model used to generate detailed descriptions from videos, which then inform VO3 and VO3.1 models for video generation.

More from DeepLearningAI

View all 80 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free