Key Moments
AI Dev 26 x SF | Paige Bailey: What's New and What's Next in AI
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Google launched a suite of powerful AI models, including multimodal Gemini and open-source Gemma, enabling complex tasks like video analysis and app generation with surprising cost-effectiveness.
Key Insights
Gemini 3.1 Flashlight can analyze 5 minutes of YouTube video (including frames and audio) for approximately $0.015.
The Gemma 4 open model family includes a 2 billion parameter version small enough to run on mobile devices.
Building a brand identity generator app via voice in AI Studio, including logo, brand voice, and design, takes minutes.
VO3.1 Light can generate realistic video with audio at a 'fraction of the cost' of previous models.
A DIY commercial recreating a Chick-fil-A ad using VO3.1 took less than a dollar and a couple of minutes.
Developers on the free tier get over 10,000 requests per day to the Gemma models via API.
Rapid release of multimodal and generative AI models
Google DeepMind has been exceptionally busy, releasing a constant stream of new AI models and features over the past year, especially in the last few months. This includes notable releases like Gemini 3.1 Flash Live for real-time interaction, Gemini 3.1 Pro (their largest model available via API), and Flashlight (their smallest but powerful API model). Other advancements cover image generation and editing with Nano Banana 2, a unified embedding space for video, audio, images, and text, music generation with Liria 3, world model building with Genie 3, a full-stack runtime for AI Studio named 'genie,' open models with Gemma 4, and cost-effective realistic video generation with VO3.1 Light and Gemini 3.1 text-to-speech for hyper-realistic audio outputs across various languages and styles. This broad spectrum covers large language models, multimodal capabilities, and generative media, showcasing a significant pace of innovation.
Gemini's native multimodality and AI Studio's capabilities
Gemini models are natively multimodal, meaning they can process and understand various data types simultaneously: video, images, audio, text, and code. Crucially, they can also *output* multiple modalities, including text, code, audio tokens, and frame-by-frame images that can be stitched into video. Gemini supports function calling and native tools like code execution and grounding with Google Search. Google AI Studio offers a free platform to experiment with these models using a personal Gmail account. Users can enable features like code execution, function calling, Google Search grounding, Google Maps grounding, and URL context for retrieval.
Analyzing video content with Gemini for pennies
A compelling demonstration involved analyzing a 5-minute segment of a YouTube video about dinosaurs. By grounding the model with the video URL and setting a specific time frame, Gemini 3.1 Flashlight, with search grounding enabled and minimal thinking level, processed the video's frames and audio. The resulting metadata, including a table of dinosaurs with timestamps and fun facts, was generated efficiently. The analysis of this 5-minute video segment, sampled at one frame per second with associated audio tokens, consumed approximately 39,900 tokens and cost around $0.015. This illustrates the model's ability to extract detailed information from video content at a remarkably low cost and high speed. The 'get code' feature in AI Studio generates the necessary Python or TypeScript code to replicate such tasks, streamlining integration into projects.
Comparing AI models and cost-effectiveness
AI Studio's 'compare mode' allows direct head-to-head comparisons of different models. In a test bounding boxes around green Lego bricks, Gemini 3.1 Flashlight quickly provided accurate results. Notably, the cost for this task was a tiny fraction of a penny. The transcript highlights that analyzing a full video with understanding of frames, audio, and context can be done for mere pennies. This cost-effectiveness, combined with the ability to perform complex tasks like segmentation masks or detailed object detection (e.g., specific car models), underscores the value of these advanced models for developers and businesses looking to process large amounts of multimodal data efficiently.
Gemini 3.1 Flash Live for real-time interaction
Gemini 3.1 Flash Live enables real-time, conversational interactions. It supports sharing desktops or video feeds and integrates tools like reasoning, Google Search grounding, and function calling. Demonstrations included live screen sharing where the model could identify and describe Python code snippets, and even translate its responses into Spanish and Hindi. It also handled a video feed, accurately identifying fingers held up and describing the background. System instructions can be modified to enforce specific response behaviors, such as responding only in German, as shown with a query about an AI dev conference and Deep Learning.ai.
Gemma 4: Open-source models for diverse applications
Gemma 4 is Google's latest family of open models, available in four sizes: 2B, 4B, and two larger 'moderate-sized' versions. The 2B parameter version is small enough to run on mobile devices, while the 4B fits on a laptop. These models are Apache 2.0 licensed, making them easily usable for commercial applications. Performance is strong, with the 26B mixture of experts and 31B dense models showing competitive results in chatbot arenas. They support multimodal tasks (audio, video, images) similar to Gemini. Developers on the free tier receive over 10,000 requests per day to Gemma models via API, offering a cost-effective way to integrate these capabilities. Additionally, AI Studio offers a 'build' feature to create apps via voice, with options for starting from scratch, using a 'feeling lucky' function, or generating tab completions.
Streamlining app development with AI Studio 'Build'
The 'build' feature in AI Studio simplifies app creation, allowing users to describe their desired app via voice. It can generate example apps or assist with feature suggestions. The process involves defining the app's purpose, theme, and desired aesthetics. Users can select models like 'Gemini 3.1 Pro Preview' or 'Gemini 3 Flash' and configure settings, microphone sources, and system instructions. Secrets management, version control, and GitHub integration are supported. The generated code is visible as the app is built, and apps can be published directly or shared via URL. The gallery showcases diverse app examples, including those using LIA 3 for music generation, Nano Banana 2 for design, and VO models for video. The brand identity generator example demonstrates quick creation of logos, brand voices ('energetic, playful, fiercely magical'), and content generators inspired by user inputs and mood boards.
Advanced applications in robotics, AR, and video generation
Gemini APIs are applicable to robotics, enabling real-time interaction with robotic actions, whether on-device or triggered remotely. For augmented reality (AR), Gemini Live can dynamically explain environments, provide directions, and perform real-time translation. The VO3.1 series, particularly VO3.1 Light and Flash, offers cost-effective realistic video generation. This includes capabilities like reference-powered video, animated images with guidance, natural language camera control (moving, rotating, zooming), outpainting/inpainting scenes, and character control. Recreating a commercial using these models has become vastly more efficient: a process that took 20 minutes with older tools and nearly $1 with VO3, now takes minutes and costs under a dollar with VO3.1. This hyper-personalization and rapid generation of video content is becoming increasingly accessible.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Books
●People Referenced
Common Questions
Google DeepMind has recently released several advanced AI models including Gemini 3.1 Flash Live for real-time interaction, Gemini 3.1 Pro, and various generative models like Nano Banana 2 for images and Liria 3 for music. They also offer open models like Gemma and video generation models in the VO3.1 series.
Topics
Mentioned in this video
An AI model for music generation.
A smaller but very powerful AI model from Google, available via API.
An AI model for creating and editing images, with interleaved text capabilities.
A new model allowing the use of video, audio, images, and text in a single embedding space.
A runtime environment for generating AI applications, incorporating OAUTH and database support.
Natively multimodal AI models from Google that can understand and output various data types like video, images, audio, text, and code.
A programming language used for code execution within Gemini's sandbox environment in AI Studio.
A programming language that can be used to export and utilize code generated from AI Studio.
A personal email service from Google that can be used to access AI Studio for free.
A mapping service that can be used for grounding AI model outputs within AI Studio.
A programming language that can be used to export and utilize code generated from AI Studio.
An open-source AI model family from Google, available in four sizes and licensed under Apache 2.
A music generation model used for creating music clips in various styles and languages.
A tool mentioned as being used by people to build revenue-generating companies, alongside AI Studio.
A search engine that can be integrated with AI models for grounding and providing cited sources.
A feature within AI Studio that allows users to create applications using voice commands or automated generation.
A platform for world model building, allowing users to describe and experience virtual environments.
A Google Cloud service where applications built with AI Studio Build can be published.
A video editing software used in earlier methods for creating commercials with AI-generated components.
A video editing software mentioned alongside Camtasia for stitching together AI-generated video elements.
An AI model used to generate detailed descriptions from videos, which then inform VO3 and VO3.1 models for video generation.
A platform where users can download AI models, including Google's Gemma open model family.
A video-sharing platform used as an input source in AI Studio for video analysis.
A company whose products (CPUs) are indirectly referenced in the context of needing powerful hardware for large AI models.
A platform with integration into AI Studio Build, enabling features like OAUTH for Google sign-in and version control.
A basketball team whose court is used as a setting in a dynamic world generation example for Genie 3.
A fast-food chain whose chicken sandwich commercial is used as a baseline for demonstrating the evolution of video generation models.
Objects used in a demonstration within AI Studio to test image analysis capabilities of AI models.
A specific car model mentioned as an example for object detection bounding box generation.
A specific car model mentioned as an example for object detection bounding box generation.
More from DeepLearningAI
View all 80 summaries
33 minAI Dev 26 x SF | Carter Rabasa: File Systems Are the New Primitive for AI Agents
28 minAI Dev 26 x SF | Melissa Herrera: Your Agents Should Be Durable
31 minAI Dev 26 x SF | Vlad Luzin: Herding Cats—The Hidden Challenges of Multi-Agent Autonomy
34 minAI Dev 26 x SF | William Imoh & Charlie Wood: Closing the Care Gap
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free