What are the benefits of using on-device models like Gemma?

On-device models offer advantages such as offline functionality and enhanced privacy, as data processing occurs locally without needing to send information to external APIs. This makes them ideal for scenarios where connectivity is limited or data sensitivity is high.

How does Google collaborate with external partners for model releases?

Google works with numerous open-source partners like Llama.cpp, Olama, MLX, Hugging Face, vLLM, NVIDIA, and AMD. These collaborations are crucial for ensuring models like Gemma 4 are accessible and performant across various platforms and hardware.

What are the key differences between Gemini and Gemma models?

Gemini models excel in broad world knowledge and factuality, suited for complex, long-running tasks. Gemma models, while rapidly improving, are optimized for on-device and agentic capabilities, with smaller sizes offering faster inference for specific functions.

What advancements have been made in Gemma 4's multimodal capabilities?

Building on Gemini 3 research, Gemma 4 can understand audio, images, and short videos. It supports speech recognition, speech-to-text translation, and basic speech understanding, along with object detection and captioning for vision tasks.

Why is fine-tuning less common now for general conversational models?

Models are becoming highly capable out-of-the-box, reducing the need for extensive fine-tuning. Prompting techniques are often sufficient for customizing behavior, though fine-tuning remains valuable for highly specialized domains like finance or healthcare.

What are the trade-offs between dense models and Mixture-of-Experts (MOE) models?

Dense models offer raw intelligence and are easier to fine-tune, with larger sizes fitting consumer GPUs. MOE models provide extremely fast inference within constraints but can be challenging to fine-tune for instruction following due to routing complexities.

How is Google DeepMind's Developer Relations (DevRel) team evolving in the AI era?

The DevRel team is redefining its role in an AI-centric research organization. They focus on engaging the community, building tools, and gathering feedback to ensure Google's AI offerings meet developer needs, with global expansion into locations like Singapore and India.

Key Moments

⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind

Latent Space Podcast

Science & Technology6 min read30 min video

May 24, 2026|202 views|6

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Google's Gemma 4 model uses a novel 'effective parameter' technique to load only a fraction of its 5B parameters for rapid, on-device inference, shrinking the gap with older closed-source models.

Key Insights

Gemma 4, Google's latest open model, uses 'effective parameters' by loading only a portion (e.g., 2B of 5B) into the GPU for inference, significantly speeding up operations.

Gemini Nano, integrated into high-end Pixel and Samsung phones, is built upon Gemma architectures (like Gemma 3N) for optimized on-device AI experiences without requiring an internet connection.

While Gemma 4 matches state-of-the-art capabilities from 1.5 years ago, Gemini excels in knowledge and factuality due to its larger model size and training.

Gemma 4 models can understand audio, short videos (30-60 seconds), and images, with capabilities like speech recognition, object detection, and captioning, optimized for on-device use.

Google's Med-Gemma 1.5, based on Gemma 3, received additional training on medical datasets, demonstrating the effectiveness of fine-tuning for specialized domains.

The trend in fine-tuning is shifting, with models like Gemma 4 performing exceptionally well out-of-the-box, reducing the need for extensive customization for many use cases.

Gemma 4: The evolution of efficient open models

Google has released Gemma 4, their most capable open model to date, emphasizing the compaction of intelligence per parameter and the integration of multimodal capabilities. A key innovation is the concept of 'effective parameters' versus 'active parameters'. Unlike traditional transformer architectures that require loading all parameters, Gemma 4 introduces a 'per-layer embedding' within its transformer blocks. This allows for a lookup table approach where only a fraction of the total parameters (e.g., 2 billion out of 5 billion for a 5B model) need to be loaded into the GPU for inference. The remaining parameters can reside on the CPU or disk, enabling extremely fast inference speeds. This design decision is particularly optimized for on-device applications like smartphones and Raspberry Pis, where computational resources are constrained. While this method might not scale as efficiently for larger, more complex tasks compared to dense architectures or Mixture of Experts (MoE) models, it represents a significant step in making powerful AI accessible and performant on edge devices. This approach is a crucial step towards enabling more sophisticated AI functionalities directly on user devices, enhancing privacy and reducing latency.

On-device AI: Gemini Nano and the future of mobile intelligence

Google's commitment to on-device AI is exemplified by Gemini Nano, which is pre-installed on high-end Pixel and Samsung phones, baked directly into the operating system. Gemini Nano is developed on top of Gemma architectures, specifically leveraging models like Gemma 3N, which were designed with phone use cases in mind. These models undergo additional training and adaptations to ensure optimal performance on mobile hardware. The 'parameter offloading' or 'download on demand' concept, similar to what is seen in Gemma 4, is also employed in these smaller models. This allows users to experience advanced AI features directly on their phones without needing an internet connection, offering benefits like enhanced privacy and immediate responsiveness. The integration of 'off-the-shelf' AI capabilities into the core mobile experience signifies a future where personal devices are increasingly intelligent and self-sufficient in handling complex AI tasks.

Bridging the gap: Gemma 4 vs. Gemini

The release of Gemma 4 marks a significant milestone, with its capabilities reportedly matching state-of-the-art closed-source models from about 1.5 years ago. This suggests a rapid advancement in open-source AI, closing the performance gap with proprietary systems. However, a key distinction remains in knowledge acquisition and general world understanding. Larger models like Google's flagship Gemini series still hold a considerable advantage in terms of accessing and processing vast amounts of information, facts, and real-world data. While Gemma 4 excels in areas like agentic capabilities, function calling, and handling conversational tasks on local hardware, Gemini is superior for tasks requiring deep factual knowledge. Google envisions a future, potentially within 1-2 years, where models as powerful as Gemini 3 Pro could run directly on smartphones, blurring the lines further between on-device and cloud-based AI capabilities.

Multimodal capabilities in Gemma 4 and on-device models

Gemma 4 inherits many of the multimodal advancements from Gemini 3, allowing smaller models to process audio, images, and short videos (up to 30-60 seconds). For audio, capabilities include speech recognition, speech-to-translated text, and basic speech understanding, enabling users to query audio files. On the vision side, improvements include object detection and captioning, though image segmentation is still a sought-after feature that is not yet supported. A current limitation is the inability to process both video and audio simultaneously within the same prompt; these modalities can be handled separately. Future enhancements through additional fine-tuning are expected to address these limitations, further expanding the on-device multimodal experience.

The multilingual strength of Gemma's tokenizer

Gemma models offer support for 140 languages, largely due to their advanced multilingual tokenizer, which is based on the Gemini tokenizer architecture. This tokenizer is highly effective at capturing the nuances of various languages. Even when comparing Gemma 3 to other models of its generation, Gemma demonstrated superior performance in specific languages, such as Vietnamese, when fine-tuned, even if the base models of competitors were generally considered stronger. This indicates that the robust tokenizer provides a solid foundation for multilingual AI applications, enabling better performance even with specialized language datasets.

Diffusion models for text generation: Speed and potential

Google is also exploring diffusion models for text generation, a departure from the dominant auto-regressive approach. The primary advantage identified is speed, particularly for tasks like code generation. Diffusion models offer potential advantages in 'fill-in-the-middle' scenarios, where specific code structures or completions are desired. While this technology is still in its early stages, with model quality sometimes lagging behind auto-regressive models, it opens up new avenues for AI architecture. The concept of a 'System 1' (fast diffusion model) and 'System 2' (slower, more deliberate auto-regressive model) in AI agents is being considered, suggesting a potential future where diffusion models handle rapid, specific tasks while auto-regressive models manage more complex reasoning.

The evolving landscape of fine-tuning

The emphasis on fine-tuning is shifting as open models like Gemma 4 become increasingly capable out-of-the-box. Historically, the fine-tuning community, active on platforms like GitHub and Reddit, drove significant innovation. However, many partners working with Gemma 4 found that the model performed so well 'out of the box' that extensive fine-tuning became unnecessary for their use cases. While fine-tuning remains valuable for highly specialized domains like healthcare or finance, or for adding very specific behaviors, general conversational adjustments can often be achieved through prompting. This trend suggests a move towards leveraging pre-trained models' strong general capabilities rather than extensive customization, making AI more accessible to a broader range of users and developers.

Challenges and future of on-device AI development

Developing AI experiences for on-device applications presents unique challenges. The idea of using multiple LoRAs (Low-Rank Adaptations) for different tasks on a single device, while conceptually flexible, could lead to substantial overhead. If each app has its own LoRA, updating the base model would require updating numerous LoRAs, which is impractical given mobile operating system update cycles. This presents an industry-wide challenge to streamline how on-device ML experiences are built and managed, balancing efficiency with user experience and maintainability.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

Common Questions

Gemma 4 is Google's latest open model, emphasizing intelligence per parameter and multimodal capabilities. A key innovation is its use of 'effective parameters,' allowing for fast inference by offloading some parameters to CPU or disk, optimizing it for on-device applications.

Topics

AI & Machine Learning Technology & Innovation Model Architecture Open-source Models Multimodal AI AI Research Developer Experience On-device AI LLM Deployment

Mentioned in this video

Software & Apps

Gemini

Google's flagship AI model, contrasted with Gemma, offering superior knowledge and factuality. It is envisioned that future versions will be powerful enough to run directly on phones.

Llama.cpp

An open-source partner that collaborates with Google to enable running Gemma models, particularly for offline use cases and integration with tools like Android Studio.

Gemma 4

Google's latest capable open model, designed to compact intelligence per parameter and include multimodal capabilities. It utilizes a technique with effective parameters, allowing for fast inference by offloading some parameters to CPU or disk.

Google Cloud

An internal Google team collaborating on the Gemma model launch, suggesting integration with cloud services.

Gemma 3

An earlier generation of Google's open models, its architecture and tokenizer were foundational for Gemma 4 and provided strong performance in multilingual tasks.

GemmaScope

A tool released in December that allows analysis of activations across different layers of models based on tokens. It involved storing petabytes of data for Gemma 3 models.

Gemini Nano

An on-device AI model integrated into Pixel phones and high-end Samsung devices, built on top of Gemma architectures, enabling local AI capabilities.

Gemini 3

The research basis for Gemma 4's multimodal capabilities, contributing improvements in understanding audio, images, and short videos.

LaMDA

Mentioned in the context of mobile integrations, indicating its role in previous on-device experiences.

MLX

An open-source partner working with Google to support Gemma model deployment.

vLLM

An open-source partner collaborating with Google to enable the use of Gemma models, including integration with Android Studio.

Android Studio

An IDE integrated with Gemma 4, featuring an agent mode to assist developers in writing code for Android applications, supporting offline model usage through Llama.cpp or vLLM.

Gemma

A fine-tuned version of Gemma 3, specifically trained on medical datasets, indicating specialized model development for domain-specific applications.

Axolotl

A library mentioned in the context of community-driven model fine-tuning and experimentation, highlighting open-source contributions to AI development.

Companies

NVIDIA

An open-source partner collaborating with Google on the Gemma model launch, indicating support for hardware acceleration.

Hugging Face

A key open-source partner for Google's Gemma models, facilitating their integration and use within a broader AI community. They also offer 'skills' for prompting agents.

AMD

An open-source partner collaborating with Google on the Gemma model launch, suggesting hardware support considerations.

DeepMind

A leading AI research lab within Google, responsible for models like Gemma and Gemini. The discussion touches on its global expansion, integration with Kaggle, and its DevRel team.

Organizations

Kaggle

A platform for data science competitions and ML, which has joined DeepMind. It's contributing to agent evaluation benchmarks and community hackathons, enhancing model development and feedback loops.

Concepts

Erdos problems

Complex mathematical problems, some of which are being solved by engineers using coding agents, demonstrating advanced problem-solving capabilities beyond traditional research.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free