How can I access and experiment with Google's AI models?

Google AI Studio provides access to many of these models, often for free using a personal Gmail account. You can enable various tools like code execution and Google Search grounding within the studio to test the models' capabilities.

What is Gemini 3.1 Flash Live and how does it differ?

Gemini 3.1 Flash Live is designed for real-time conversations and interactions. It can process inputs like screen sharing or video feeds and incorporate tools such as reasoning and Google Search grounding, offering a dynamic conversational experience.

What are the advantages of the Gemma open model family?

Gemma models are Google's open-source offering, available under the permissive Apache 2 license, making them suitable for commercial use. They come in various sizes, including versions small enough for mobile devices, and support multimodal tasks.

Can AI Studio be used to build complete applications?

Yes, AI Studio features a 'Build' mode that allows users to create applications, even using voice commands. It integrates with various AI models and services, and offers features like version control and GitHub integration for development.

How has video generation technology evolved with Google's models?

Google's VO3.1 models represent a significant advancement in video generation, allowing for complex scenes, character control, and even replicating commercials with hyper-personalization at a fraction of the cost and time compared to older methods.

Is now a good time for aspiring founders in the AI space?

The presenter strongly advocates that it has never been a better time to be a founder, especially in AI. Small teams can now achieve remarkable feats with AI tools, enabling rapid product development and scalable, hyper-personalized marketing.

Key Moments

AI Dev 26 x SF | Paige Bailey: What's New and What's Next in AI

DeepLearning.AI

Education6 min read43 min video

May 20, 2026|255 views|11

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Google launched a suite of powerful AI models, including multimodal Gemini and open-source Gemma, enabling complex tasks like video analysis and app generation with surprising cost-effectiveness.

Key Insights

Gemini 3.1 Flashlight can analyze 5 minutes of YouTube video (including frames and audio) for approximately $0.015.

The Gemma 4 open model family includes a 2 billion parameter version small enough to run on mobile devices.

Building a brand identity generator app via voice in AI Studio, including logo, brand voice, and design, takes minutes.

VO3.1 Light can generate realistic video with audio at a 'fraction of the cost' of previous models.

A DIY commercial recreating a Chick-fil-A ad using VO3.1 took less than a dollar and a couple of minutes.

Developers on the free tier get over 10,000 requests per day to the Gemma models via API.

Rapid release of multimodal and generative AI models

Google DeepMind has been exceptionally busy, releasing a constant stream of new AI models and features over the past year, especially in the last few months. This includes notable releases like Gemini 3.1 Flash Live for real-time interaction, Gemini 3.1 Pro (their largest model available via API), and Flashlight (their smallest but powerful API model). Other advancements cover image generation and editing with Nano Banana 2, a unified embedding space for video, audio, images, and text, music generation with Liria 3, world model building with Genie 3, a full-stack runtime for AI Studio named 'genie,' open models with Gemma 4, and cost-effective realistic video generation with VO3.1 Light and Gemini 3.1 text-to-speech for hyper-realistic audio outputs across various languages and styles. This broad spectrum covers large language models, multimodal capabilities, and generative media, showcasing a significant pace of innovation.

Gemini's native multimodality and AI Studio's capabilities

Gemini models are natively multimodal, meaning they can process and understand various data types simultaneously: video, images, audio, text, and code. Crucially, they can also *output* multiple modalities, including text, code, audio tokens, and frame-by-frame images that can be stitched into video. Gemini supports function calling and native tools like code execution and grounding with Google Search. Google AI Studio offers a free platform to experiment with these models using a personal Gmail account. Users can enable features like code execution, function calling, Google Search grounding, Google Maps grounding, and URL context for retrieval.

Analyzing video content with Gemini for pennies

A compelling demonstration involved analyzing a 5-minute segment of a YouTube video about dinosaurs. By grounding the model with the video URL and setting a specific time frame, Gemini 3.1 Flashlight, with search grounding enabled and minimal thinking level, processed the video's frames and audio. The resulting metadata, including a table of dinosaurs with timestamps and fun facts, was generated efficiently. The analysis of this 5-minute video segment, sampled at one frame per second with associated audio tokens, consumed approximately 39,900 tokens and cost around $0.015. This illustrates the model's ability to extract detailed information from video content at a remarkably low cost and high speed. The 'get code' feature in AI Studio generates the necessary Python or TypeScript code to replicate such tasks, streamlining integration into projects.

Comparing AI models and cost-effectiveness

AI Studio's 'compare mode' allows direct head-to-head comparisons of different models. In a test bounding boxes around green Lego bricks, Gemini 3.1 Flashlight quickly provided accurate results. Notably, the cost for this task was a tiny fraction of a penny. The transcript highlights that analyzing a full video with understanding of frames, audio, and context can be done for mere pennies. This cost-effectiveness, combined with the ability to perform complex tasks like segmentation masks or detailed object detection (e.g., specific car models), underscores the value of these advanced models for developers and businesses looking to process large amounts of multimodal data efficiently.

Gemini 3.1 Flash Live for real-time interaction

Gemini 3.1 Flash Live enables real-time, conversational interactions. It supports sharing desktops or video feeds and integrates tools like reasoning, Google Search grounding, and function calling. Demonstrations included live screen sharing where the model could identify and describe Python code snippets, and even translate its responses into Spanish and Hindi. It also handled a video feed, accurately identifying fingers held up and describing the background. System instructions can be modified to enforce specific response behaviors, such as responding only in German, as shown with a query about an AI dev conference and Deep Learning.ai.

Gemma 4: Open-source models for diverse applications

Gemma 4 is Google's latest family of open models, available in four sizes: 2B, 4B, and two larger 'moderate-sized' versions. The 2B parameter version is small enough to run on mobile devices, while the 4B fits on a laptop. These models are Apache 2.0 licensed, making them easily usable for commercial applications. Performance is strong, with the 26B mixture of experts and 31B dense models showing competitive results in chatbot arenas. They support multimodal tasks (audio, video, images) similar to Gemini. Developers on the free tier receive over 10,000 requests per day to Gemma models via API, offering a cost-effective way to integrate these capabilities. Additionally, AI Studio offers a 'build' feature to create apps via voice, with options for starting from scratch, using a 'feeling lucky' function, or generating tab completions.

Streamlining app development with AI Studio 'Build'

The 'build' feature in AI Studio simplifies app creation, allowing users to describe their desired app via voice. It can generate example apps or assist with feature suggestions. The process involves defining the app's purpose, theme, and desired aesthetics. Users can select models like 'Gemini 3.1 Pro Preview' or 'Gemini 3 Flash' and configure settings, microphone sources, and system instructions. Secrets management, version control, and GitHub integration are supported. The generated code is visible as the app is built, and apps can be published directly or shared via URL. The gallery showcases diverse app examples, including those using LIA 3 for music generation, Nano Banana 2 for design, and VO models for video. The brand identity generator example demonstrates quick creation of logos, brand voices ('energetic, playful, fiercely magical'), and content generators inspired by user inputs and mood boards.

Advanced applications in robotics, AR, and video generation

Gemini APIs are applicable to robotics, enabling real-time interaction with robotic actions, whether on-device or triggered remotely. For augmented reality (AR), Gemini Live can dynamically explain environments, provide directions, and perform real-time translation. The VO3.1 series, particularly VO3.1 Light and Flash, offers cost-effective realistic video generation. This includes capabilities like reference-powered video, animated images with guidance, natural language camera control (moving, rotating, zooming), outpainting/inpainting scenes, and character control. Recreating a commercial using these models has become vastly more efficient: a process that took 20 minutes with older tools and nearly $1 with VO3, now takes minutes and costs under a dollar with VO3.1. This hyper-personalization and rapid generation of video content is becoming increasingly accessible.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Books

●People Referenced

Common Questions

Google DeepMind has recently released several advanced AI models including Gemini 3.1 Flash Live for real-time interaction, Gemini 3.1 Pro, and various generative models like Nano Banana 2 for images and Liria 3 for music. They also offer open models like Gemma and video generation models in the VO3.1 series.

Topics

AI & Machine Learning Technology & Innovation Open-source AI Generative AI Large Language Models Multimodal AI AI Video Generation AI Development Tools AI Application Building

Mentioned in this video

Software & Apps

Liria 3

An AI model for music generation.

Flashlight

A smaller but very powerful AI model from Google, available via API.

Nano Banana 2

An AI model for creating and editing images, with interleaved text capabilities.

Embeddings model

A new model allowing the use of video, audio, images, and text in a single embedding space.

AI Studio

A runtime environment for generating AI applications, incorporating OAUTH and database support.

Gemini models

Natively multimodal AI models from Google that can understand and output various data types like video, images, audio, text, and code.

Python

A programming language used for code execution within Gemini's sandbox environment in AI Studio.

Typescript

A programming language that can be used to export and utilize code generated from AI Studio.

Gmail

A personal email service from Google that can be used to access AI Studio for free.

Google Maps

A mapping service that can be used for grounding AI model outputs within AI Studio.

Java

A programming language that can be used to export and utilize code generated from AI Studio.

Gemma

An open-source AI model family from Google, available in four sizes and licensed under Apache 2.

LIA

A music generation model used for creating music clips in various styles and languages.

Lovable

A tool mentioned as being used by people to build revenue-generating companies, alongside AI Studio.

Google Search

A search engine that can be integrated with AI models for grounding and providing cited sources.

Build

A feature within AI Studio that allows users to create applications using voice commands or automated generation.

Genie 3

A platform for world model building, allowing users to describe and experience virtual environments.

Cloud Run

A Google Cloud service where applications built with AI Studio Build can be published.

Camtasia

A video editing software used in earlier methods for creating commercials with AI-generated components.

Movie Pie

A video editing software mentioned alongside Camtasia for stitching together AI-generated video elements.

Gemini 2.5

An AI model used to generate detailed descriptions from videos, which then inform VO3 and VO3.1 models for video generation.

People

Paige Bailey

Leads engineering for developer relations at Google DeepMind, based in the Bay Area.

Andre Carpathy

His quote on video being a powerful medium is included in the presentation.

Organizations

Google DeepMind

An organization involved in AI research and development, responsible for releasing numerous AI models and features.

Companies

Hugging Face

A platform where users can download AI models, including Google's Gemma open model family.

YouTube

A video-sharing platform used as an input source in AI Studio for video analysis.

Intel

A company whose products (CPUs) are indirectly referenced in the context of needing powerful hardware for large AI models.

GitHub

A platform with integration into AI Studio Build, enabling features like OAUTH for Google sign-in and version control.

Golden State Warriors

A basketball team whose court is used as a setting in a dynamic world generation example for Genie 3.

Chick-fil-A

A fast-food chain whose chicken sandwich commercial is used as a baseline for demonstrating the evolution of video generation models.

Products

Lego bricks

Objects used in a demonstration within AI Studio to test image analysis capabilities of AI models.

Ford Mustangs

A specific car model mentioned as an example for object detection bounding box generation.

Nissan Ultimas

A specific car model mentioned as an example for object detection bounding box generation.

Books

Trans Metropolitan

A comic book mentioned as an inspiration for brand identity generation.

Locations

San Francisco

A city mentioned in a demonstration of Gemini's ability to provide weather information in different languages.

Mordor

A fictional location from The Lord of the Rings, used as an example scene for VO3.1 video generation.

Legislation & Policy

Apache 2

The license under which Google's Gemma open model family is released, allowing for commercial use.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free