How does Gemini's thinking budget and thought summaries benefit developers?

Thinking budgets provide developers with more control over model usage and cost. Thought summaries, currently in development, aim to streamline the understanding of model reasoning processes, offering a step towards showing full thoughts but in a more digestible format.

What are the advantages of Gemini's native audio output feature?

The native audio output in Gemini offers high-quality voices and the ability to switch seamlessly between languages. This multilingual capability is particularly useful for global applications and diverse user bases.

What is the significance of implicit context caching in Gemini?

Implicit context caching automatically saves costs for developers by retaining conversation context without manual configuration. This feature works in the background, ensuring users save money on API calls for repeated conversational threads.

What challenges do developers face when starting with Google's Live API?

Challenges include awareness of the API's existence, limitations in session length for audio and video input, and the need for better function calling and tool chaining performance. Developers also face a higher commitment bar due to the bespoke infrastructure required for live API integrations.

How does Google approach building voice AI applications with Gemini?

Google utilizes a combination of componentized approaches and integrated models. Early architectures focused on native audio input with TTS output, but now they offer audio-to-audio capabilities, aiming to balance latency, cost, and output quality while supporting multiple languages.

What are the future aspirations for Gemini, such as Gemini 5.0?

Future aspirations for Gemini include broader language support, ensuring AI is more accessible globally. There's also a desire to integrate more capabilities into the main model, moving towards a more unified and powerful AI system.

Key Moments

[AIEWF Preview] Gemini in 2025 and Realtime Voice AI

Latent Space Podcast

Science & Technology4 min read25 min video

Jun 2, 2025|983 views|21|2

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Google IO previews Gemini 2025: real-time voice, budget options, thought summaries, and generative UIs.

Key Insights

Gemini 2.5 Pro pricing gets more granular with thinking budgets and the option to disable reasoning for cost savings.

New features like thought summaries and native audio output enhance developer control and user experience.

The Live API is evolving with extended session lengths, improved tool calling, and more flexible system instruction changes for complex workflows.

Real-time voice AI development requires specialized infrastructure, balancing latency, cost, and output quality.

Future Gemini development focuses on integrating diverse capabilities into a single, powerful model, moving towards more natural and versatile AI interactions.

Generative UI powered by models like Gemini Diffusion presents a promising future for dynamic and on-the-fly interface creation.

ENHANCEMENTS TO GEMINI MODELS AND PRICING

Google IO introduced significant updates to Gemini, focusing on developer control and cost efficiency. A key highlight is the upcoming availability of thinking budgets for Gemini 2.5 Pro, allowing developers to manage computational resources more effectively. Additionally, the option to disable the reasoning component of 2.5 Pro will be offered, catering to use cases that require a raw, non-reasoning model and further optimizing costs. These features, along with thought summaries, aim to provide developers with greater flexibility and precision in utilizing Gemini's capabilities.

ADVANCEMENTS IN AUDIO AND MULTIMODAL CAPABILITIES

The event showcased substantial progress in Gemini's audio and multimodal functionalities. Native audio output, a significant personal highlight, enables seamless voice generation with impressive naturalness and language switching abilities, even supporting non-official languages like Klingon. Another notable release is the URL context tool, designed to retrieve in-depth information from web pages while respecting publisher ecosystems, thereby enabling use cases like building custom research agents. These features underscore a move towards more versatile and integrated AI experiences.

IMPROVEMENTS TO THE GEMINI LIVE API

The Live API, crucial for real-time applications, received several key upgrades. Challenges related to session length have been addressed through new developer controls, allowing for extended audio and video interactions. Improvements in tool calling and function performance enhance the API's robustness for complex tasks. Developers are also gaining more options for managing system instructions dynamically, which is vital for multi-state agents and sophisticated workflows such as customer support or gaming applications.

THE EVOLVING LANDSCAPE OF REAL-TIME VOICE AI

Developing real-time voice agents presents unique challenges, demanding specialized infrastructure that balances latency, cost, and output quality. While traditional speech-to-text followed by LLM processing remains viable, the trend is shifting towards end-to-end audio-to-audio architectures for greater efficiency. This evolution necessitates robust voice activity detection, context management, and sophisticated networking protocols like WebRTC to handle the demands of natural, conversational AI interactions within stringent latency requirements.

THE VISION FOR A UNIFIED GEMINI MODEL

Google's long-term vision for Gemini centers on creating a single, unified model that integrates a wide array of capabilities. This approach contrasts with splintering functionalities into separate models, aiming instead to merge different strengths, such as reasoning and multimodal understanding, to achieve emergent performance gains. The integration of these capabilities is seen as a key driver for innovation, leading to unexpected yet powerful outcomes. The focus remains on bringing diverse functionalities back into the mainline Gemini model.

EMERGENT USE CASES AND FUTURE POTENTIAL

The potential applications for advanced AI models like Gemini are rapidly expanding. Gemini Diffusion, for instance, hints at the future of generative user interfaces, where UIs can be created dynamically based on user interactions. Proactive audio features, which allow AI to ignore irrelevant background noise, and speaker diarization, enabling the recognition of different voices, are pushing the boundaries of natural human-computer interaction. Asynchronous function calling also promises to streamline complex AI workflows.

COMMUNITY FEEDBACK AND DEVELOPER PARTNERSHIPS

Developer feedback has been instrumental in shaping the Gemini Live API and its related tools. Partnerships with organizations like Daily have provided valuable insights, particularly regarding the integration of voice and audio components. The development of open-source frameworks like Pipecat, which support Gemini models, further empowers the community to build sophisticated voice orchestration systems. This collaborative ecosystem is crucial for scaling AI development and fostering innovation.

LOOKING AHEAD: WISH LIST FOR FUTURE INNOVATIONS

Speculation about future AI advancements often fuels excitement, with developers expressing desires for even more powerful and accessible tools. The wish list for upcoming Google IO events includes further integration of capabilities into the core Gemini model, enhanced multilingual support to cater to a global user base, and potentially the arrival of next-generation Gemini versions. The overarching goal remains to make advanced AI capabilities more seamless, versatile, and universally available to developers worldwide.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

Key announcements for Gemini at Google IO included thinking budgets for 2.5 Pro, thought summaries, native audio output with multilingual capabilities, and the URL context tool. Implicit context caching was also highlighted as a significant cost-saving feature for developers.

Topics

AI & Machine Learning Technology & Innovation Large Language Models Voice AI AI Development Multimodal AI Real-time AI Generative UI API Development

Mentioned in this video

Companies

Google

The company hosting Google IO and developing Gemini.

Shopify

A company that showcased a demo at Next involving setting up DNS using Cloudflare.

DeepMind

A research lab part of Google, contributing to Gemini development.

Cloudflare

A company mentioned in the context of a Shopify demo for setting up DNS.

Software & Apps

Gemini

Google's AI model family, with discussions on its multimodal and reasoning capabilities.

Gemini 3.0

An anticipated future version of the Gemini model.

Cursor

An application where thought summaries are now live.

NotebookLM

A Google product that uses text-to-speech (TTS) models, powering early versions of audio output.

Gemini Diffusion

A diffusion language model from Google, highlighted as an 'underrated' pick with potential for generative UI.

AI Studio

A tool or platform developed by Google that people are using.

Gemini 2.5 Pro

An updated version of Google's Gemini model, now with thinking budgets. Will have the ability to disable reasoning.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free