Key Moments
⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Google's Gemma 4 model uses a novel 'effective parameter' technique to load only a fraction of its 5B parameters for rapid, on-device inference, shrinking the gap with older closed-source models.
Key Insights
Gemma 4, Google's latest open model, uses 'effective parameters' by loading only a portion (e.g., 2B of 5B) into the GPU for inference, significantly speeding up operations.
Gemini Nano, integrated into high-end Pixel and Samsung phones, is built upon Gemma architectures (like Gemma 3N) for optimized on-device AI experiences without requiring an internet connection.
While Gemma 4 matches state-of-the-art capabilities from 1.5 years ago, Gemini excels in knowledge and factuality due to its larger model size and training.
Gemma 4 models can understand audio, short videos (30-60 seconds), and images, with capabilities like speech recognition, object detection, and captioning, optimized for on-device use.
Google's Med-Gemma 1.5, based on Gemma 3, received additional training on medical datasets, demonstrating the effectiveness of fine-tuning for specialized domains.
The trend in fine-tuning is shifting, with models like Gemma 4 performing exceptionally well out-of-the-box, reducing the need for extensive customization for many use cases.
Gemma 4: The evolution of efficient open models
Google has released Gemma 4, their most capable open model to date, emphasizing the compaction of intelligence per parameter and the integration of multimodal capabilities. A key innovation is the concept of 'effective parameters' versus 'active parameters'. Unlike traditional transformer architectures that require loading all parameters, Gemma 4 introduces a 'per-layer embedding' within its transformer blocks. This allows for a lookup table approach where only a fraction of the total parameters (e.g., 2 billion out of 5 billion for a 5B model) need to be loaded into the GPU for inference. The remaining parameters can reside on the CPU or disk, enabling extremely fast inference speeds. This design decision is particularly optimized for on-device applications like smartphones and Raspberry Pis, where computational resources are constrained. While this method might not scale as efficiently for larger, more complex tasks compared to dense architectures or Mixture of Experts (MoE) models, it represents a significant step in making powerful AI accessible and performant on edge devices. This approach is a crucial step towards enabling more sophisticated AI functionalities directly on user devices, enhancing privacy and reducing latency.
On-device AI: Gemini Nano and the future of mobile intelligence
Google's commitment to on-device AI is exemplified by Gemini Nano, which is pre-installed on high-end Pixel and Samsung phones, baked directly into the operating system. Gemini Nano is developed on top of Gemma architectures, specifically leveraging models like Gemma 3N, which were designed with phone use cases in mind. These models undergo additional training and adaptations to ensure optimal performance on mobile hardware. The 'parameter offloading' or 'download on demand' concept, similar to what is seen in Gemma 4, is also employed in these smaller models. This allows users to experience advanced AI features directly on their phones without needing an internet connection, offering benefits like enhanced privacy and immediate responsiveness. The integration of 'off-the-shelf' AI capabilities into the core mobile experience signifies a future where personal devices are increasingly intelligent and self-sufficient in handling complex AI tasks.
Bridging the gap: Gemma 4 vs. Gemini
The release of Gemma 4 marks a significant milestone, with its capabilities reportedly matching state-of-the-art closed-source models from about 1.5 years ago. This suggests a rapid advancement in open-source AI, closing the performance gap with proprietary systems. However, a key distinction remains in knowledge acquisition and general world understanding. Larger models like Google's flagship Gemini series still hold a considerable advantage in terms of accessing and processing vast amounts of information, facts, and real-world data. While Gemma 4 excels in areas like agentic capabilities, function calling, and handling conversational tasks on local hardware, Gemini is superior for tasks requiring deep factual knowledge. Google envisions a future, potentially within 1-2 years, where models as powerful as Gemini 3 Pro could run directly on smartphones, blurring the lines further between on-device and cloud-based AI capabilities.
Multimodal capabilities in Gemma 4 and on-device models
Gemma 4 inherits many of the multimodal advancements from Gemini 3, allowing smaller models to process audio, images, and short videos (up to 30-60 seconds). For audio, capabilities include speech recognition, speech-to-translated text, and basic speech understanding, enabling users to query audio files. On the vision side, improvements include object detection and captioning, though image segmentation is still a sought-after feature that is not yet supported. A current limitation is the inability to process both video and audio simultaneously within the same prompt; these modalities can be handled separately. Future enhancements through additional fine-tuning are expected to address these limitations, further expanding the on-device multimodal experience.
The multilingual strength of Gemma's tokenizer
Gemma models offer support for 140 languages, largely due to their advanced multilingual tokenizer, which is based on the Gemini tokenizer architecture. This tokenizer is highly effective at capturing the nuances of various languages. Even when comparing Gemma 3 to other models of its generation, Gemma demonstrated superior performance in specific languages, such as Vietnamese, when fine-tuned, even if the base models of competitors were generally considered stronger. This indicates that the robust tokenizer provides a solid foundation for multilingual AI applications, enabling better performance even with specialized language datasets.
Diffusion models for text generation: Speed and potential
Google is also exploring diffusion models for text generation, a departure from the dominant auto-regressive approach. The primary advantage identified is speed, particularly for tasks like code generation. Diffusion models offer potential advantages in 'fill-in-the-middle' scenarios, where specific code structures or completions are desired. While this technology is still in its early stages, with model quality sometimes lagging behind auto-regressive models, it opens up new avenues for AI architecture. The concept of a 'System 1' (fast diffusion model) and 'System 2' (slower, more deliberate auto-regressive model) in AI agents is being considered, suggesting a potential future where diffusion models handle rapid, specific tasks while auto-regressive models manage more complex reasoning.
The evolving landscape of fine-tuning
The emphasis on fine-tuning is shifting as open models like Gemma 4 become increasingly capable out-of-the-box. Historically, the fine-tuning community, active on platforms like GitHub and Reddit, drove significant innovation. However, many partners working with Gemma 4 found that the model performed so well 'out of the box' that extensive fine-tuning became unnecessary for their use cases. While fine-tuning remains valuable for highly specialized domains like healthcare or finance, or for adding very specific behaviors, general conversational adjustments can often be achieved through prompting. This trend suggests a move towards leveraging pre-trained models' strong general capabilities rather than extensive customization, making AI more accessible to a broader range of users and developers.
Challenges and future of on-device AI development
Developing AI experiences for on-device applications presents unique challenges. The idea of using multiple LoRAs (Low-Rank Adaptations) for different tasks on a single device, while conceptually flexible, could lead to substantial overhead. If each app has its own LoRA, updating the base model would require updating numerous LoRAs, which is impractical given mobile operating system update cycles. This presents an industry-wide challenge to streamline how on-device ML experiences are built and managed, balancing efficiency with user experience and maintainability.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
Common Questions
Gemma 4 is Google's latest open model, emphasizing intelligence per parameter and multimodal capabilities. A key innovation is its use of 'effective parameters,' allowing for fast inference by offloading some parameters to CPU or disk, optimizing it for on-device applications.
Topics
Mentioned in this video
Google's flagship AI model, contrasted with Gemma, offering superior knowledge and factuality. It is envisioned that future versions will be powerful enough to run directly on phones.
An open-source partner that collaborates with Google to enable running Gemma models, particularly for offline use cases and integration with tools like Android Studio.
Google's latest capable open model, designed to compact intelligence per parameter and include multimodal capabilities. It utilizes a technique with effective parameters, allowing for fast inference by offloading some parameters to CPU or disk.
An internal Google team collaborating on the Gemma model launch, suggesting integration with cloud services.
An earlier generation of Google's open models, its architecture and tokenizer were foundational for Gemma 4 and provided strong performance in multilingual tasks.
A tool released in December that allows analysis of activations across different layers of models based on tokens. It involved storing petabytes of data for Gemma 3 models.
An on-device AI model integrated into Pixel phones and high-end Samsung devices, built on top of Gemma architectures, enabling local AI capabilities.
The research basis for Gemma 4's multimodal capabilities, contributing improvements in understanding audio, images, and short videos.
Mentioned in the context of mobile integrations, indicating its role in previous on-device experiences.
An open-source partner working with Google to support Gemma model deployment.
An open-source partner collaborating with Google to enable the use of Gemma models, including integration with Android Studio.
An IDE integrated with Gemma 4, featuring an agent mode to assist developers in writing code for Android applications, supporting offline model usage through Llama.cpp or vLLM.
A fine-tuned version of Gemma 3, specifically trained on medical datasets, indicating specialized model development for domain-specific applications.
A library mentioned in the context of community-driven model fine-tuning and experimentation, highlighting open-source contributions to AI development.
An open-source partner collaborating with Google on the Gemma model launch, indicating support for hardware acceleration.
A key open-source partner for Google's Gemma models, facilitating their integration and use within a broader AI community. They also offer 'skills' for prompting agents.
An open-source partner collaborating with Google on the Gemma model launch, suggesting hardware support considerations.
A leading AI research lab within Google, responsible for models like Gemma and Gemini. The discussion touches on its global expansion, integration with Kaggle, and its DevRel team.
More from Latent Space
View all 219 summaries
70 minDevin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray
71 min🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub
72 minAI Agents Need Computers: 74% MoM Growth, 850K/Day Runs, & New Agent Cloud — Ivan Burazin, Daytona
90 minThe Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free