What were the core research and product challenges 11 Labs initially aimed to solve?

They aimed to improve foundational models for audio and voice research, and build applied AI products to fix customer problems. Initially, they focused on improving text-to-speech quality, voice replication, and contextual delivery.

Can 11 Labs' AI models replicate specific voice characteristics and emotions?

Yes, 11 Labs has made significant advancements in recreating voice characteristics with more abstraction than the previous hard-coded parameter approach. They've also progressed in detecting and generating emotional responses, though this is an ongoing area of research.

What progress has been made in AI dubbing and localization by 11 Labs?

By 2024, 11 Labs achieved significant AI localization capabilities, enabling the dubbing of speeches and conversations across languages while preserving original delivery. This includes work on high-profile figures and world leaders.

What is the difference between cascaded and fused architectures in voice AI?

Cascaded systems use separate models for transcription, translation, and speech generation, offering high reliability. Fused systems train models together for speed, potentially sacrificing some reliability but offering faster, more integrated responses.

How does 11 Labs ensure security and safety with its voice generation technology?

11 Labs builds safety into its models by enabling content tracing, moderating abuse, and contributing to systems for detecting AI-generated audio. They also advise against using voice authentication for security due to its vulnerabilities.

What are the biggest bottlenecks for 11 Labs and the wider audio AI space?

Key bottlenecks include acquiring incredible talent and research expertise, the need for significant compute power, and the challenge of truly combining cascaded components for interactive, personalized AI experiences that understand emotion and context.

What is 11 Labs' vision for the next five years?

They aim to lead foundational research in audio, expand into conversational AI and other modalities, and become a go-to platform for businesses and creators by providing advanced tooling for applied AI development.

How does 11 Labs engage with governments, particularly in challenging regions like Ukraine?

11 Labs has supported initiatives like Ukraine's DIA app by adding voice capabilities, improving citizen access to services. They are committed to supporting Western-allied countries within legal guidelines.

What is 11 Labs' stance on the AI race with China and potential IP issues?

11 Labs actively works to prevent distillation attacks and uses IP signals from China as a security enhancement. They acknowledge great models emerging from China, particularly for language-specific nuances, and aim to outcompete on service quality.

Why are studios hesitant to fully adopt AI voiceovers, and how is 11 Labs addressing this?

Hesitancy stems from fears of 'AI slop' and unresolved economic models regarding IP. 11 Labs focuses on 'middle-to-middle' AI tools for iterative creative processes and has recently enabled more control over narration, which is starting to gain studio adoption.

What is the future of on-device AI models for companies like 11 Labs?

11 Labs is developing on-device models, focusing on constrained languages for quality. While on-device TTS is becoming feasible, more complex interactive features with emotion transfer and reliability are still primarily cloud-based, with a gap expected to exist for a while.

Key Moments

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

Stanford Online

Education7 min read67 min video

May 4, 2026|2,941 views|49

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

ElevenLabs' voice technology achieved near-perfect AI dubbing in 2024, but its future lies in real-time, emotionally intelligent voice agents, with current limitations in natural interaction.

Key Insights

ElevenLabs started with a focus on fixing AI dubbing, inspired by the poor quality of dubbed movies in Poland where one male voice narrates all characters.

The company's initial approach involved a cascaded system of transcription, translation (LLM), and text-to-speech, but improved LLM translation was a bottleneck until 2024.

In 2022, ElevenLabs pivoted to focus on improving text-to-speech quality, specifically voice replication and natural delivery, inspired by open-source models like Tortoise.

By 2024, ElevenLabs achieved AI localization, enabling high-quality dubbing in original voices, demonstrated by dubbing Javier Milei's UN speech into English.

The future of voice agents, according to ElevenLabs, lies in fused models for speed or cascaded models for reliability, with a current emphasis on emotional expressivity and controllable interactions.

ElevenLabs achieved over $330 million in revenue in 2025, growing to over $430 million ARR, with a team of over 450 people, and prioritizes value-based pricing for its services.

From Polish Dubbing to Global Voice Solutions

ElevenLabs' journey began with a deeply personal problem: the dismal quality of movie dubbing in Poland, where a single, monotone male voice narrates all characters, stripping films of emotional nuance. This frustration, shared by co-founders Mati Staniszewski and Piotr Dabkowski, who both came from Google and Palantir, fueled their mission to revolutionize audio and voice technology. They aimed to fix foundational research in audio and voice and build applied AI products. Initially, they considered tackling AI dubbing for all languages, a complex system requiring transcription, translation, and text-to-speech. However, early market research revealed a more immediate need: simpler voiceover corrections and the ability to replace voice segments within existing recordings, even in the original language. This led to a strategic pivot, focusing on improving the text-to-speech (TTS) component as a more accessible entry point. The company also embraced a Product-Led Growth (PLG) motion, engaging closely with creators and developers, initially through platforms like Discord, to rapidly iterate based on real-world use cases and feedback. This community-centric approach remains a cornerstone of their development, fostering contributions and incorporating user data to refine models and identify new applications.

Innovating the Last Mile: Text-to-Speech Advancement

In 2022, as the AI landscape was largely dominated by discussions of crypto and the metaverse, ElevenLabs identified text-to-speech as the most critical component to perfect. The state-of-the-art in TTS at the time lacked the ability to replicate voice characteristics accurately or maintain the natural flow and emotional intonation of human speech across longer passages. Drawing inspiration from emerging transformer architectures and diffusion models, and crucially, from open-source advancements like James Becker's Tortoise TTS, ElevenLabs focused on improving the generation and voice replication capabilities. Their key innovations centered on creating more flexible voice models that could better capture subtle nuances and context, moving away from hard-coded parameters like age and accent. This strategic prioritization of TTS paid off, enabling them to build a robust foundation for future expansion into more complex audio applications, all while bootstrapping the company on modest initial compute resources.

A Phased Approach to AI Audio Capabilities

ElevenLabs strategically rolled out its capabilities in phases, reflecting the complexity of audio processing. In 2022, the primary breakthrough was in high-quality, natural-sounding English text-to-speech. This was followed in 2023 by advancements in cross-language narration, voice recreation for users, a voice marketplace, and specialized tools for authors creating audiobooks. The significant leap forward occurred in 2024 with the integration of improved transcription, LLM-based translation, and advanced speech generation models, culminating in what they term 'AI localization.' This enabled the high-fidelity dubbing of speeches and conversations into different languages while preserving the original speaker's distinct voice and emotional delivery. Examples include Javier Milei's UN speech and conversations with world leaders like Volodymyr Zelenskyy and Narendra Modi.

The Future of Voice Agents: Cascaded vs. Fused

Looking ahead, the frontier of voice AI for ElevenLabs lies in creating sophisticated voice agents capable of real-time, emotionally intelligent interaction. They are exploring two primary architectural approaches: cascaded systems and fused models. Cascaded systems, which maintain separate models for speech-to-text, LLMs, and text-to-speech, offer higher reliability and modularity, making them suitable for enterprise applications requiring precision, such as customer support or financial transactions. Fused models, conversely, aim to combine these functions into a single, end-to-end model, promising lower latency and faster responses, potentially ideal for companion or informal interactions. While fused models show promise for speed, ElevenLabs currently favors the cascaded approach for business applications due to its robustness and controllability. A key area of development is enhancing emotional expressivity and controllability, enabling agents to respond with nuanced emotions—excitement, reassurance, or empathy—based on user input. This requires significant investment in data labeling and model training to accurately interpret and generate emotional speech.

Business Growth and Collaborative Leadership

ElevenLabs has experienced explosive growth, reaching over $430 million in ARR by early 2024, a testament to their product-market fit and strategic execution. The company scaled to over 450 employees, with key hubs in London, New York, Warsaw, and San Francisco, maintaining a culture of small, empowered teams focused on rapid iteration and customer problem-solving. Their business model balances enterprise solutions with a significant PLG component, ensuring broad accessibility. Pricing is firmly anchored in the value delivered to customers, aiming to capture a fraction of that value rather than being cost-driven. This philosophy, coupled with close collaboration with major businesses, allows for predictable revenue forecasting. Notably, ElevenLabs also champions a collaborative approach within the AI ecosystem, even with competitors, exemplified by their support for initiatives like Sesame and by open-sourcing certain technologies. Staniszewski highlighted this ethos, stating that partnerships and shared progress are crucial for advancing the frontier, a perspective that contrasts with some more insular industry dynamics.

Addressing Security, Ethics, and Global Perspectives

Security and ethical considerations are paramount for ElevenLabs. They build safety features directly into their models, including content traceability, moderation against fraud, and watermarking systems to identify AI-generated audio. They advocate for robust security measures beyond voice authentication, given the ease of voice replication, and have even developed creative 'counter-offensive' uses for voice agents against scammers. Regarding global deployments, ElevenLabs is aligned with Western allies, adhering to legal guidance and actively combating 'distillation attacks.' While acknowledging excellent audio models emerging from regions like China, they focus on outcompeting through superior service and proprietary technology. They also emphasize the importance of open-source contributions from Western labs to foster innovation globally, aiming to provide the same powerful tools to individual creators as to large enterprises.

On-Device Models and the Future Platform

ElevenLabs is making strides in bringing its models on-device, aiming to offer constrained, high-quality TTS capabilities for broader accessibility. However, they acknowledge a quality gap remains between on-device and cloud-based solutions, particularly concerning interactivity, emotional transfer, and advanced features. Their strategy prioritizes achieving top-tier quality before fully committing to on-device deployment. Looking five years ahead, ElevenLabs envisions itself as a go-to platform for businesses and creators, providing not just advanced audio models but comprehensive tooling for applied AI. This involves deep customization for specific business needs, integration with various communication channels (phone, chat, email), database connectivity, and robust evaluation frameworks. They see a future where 3-5 dominant platforms facilitate conversational interactions between businesses and audiences, and they aim to be a leader in this space, enabling seamless application development and deployment.

Real-World Impact and Creative Applications

Beyond commercial applications, ElevenLabs is deeply involved in profoundly impactful projects. They have helped nearly 10,000 individuals who lost their voice due to conditions like ALS or throat cancer to synthesize new voices, enabling them to communicate naturally again. In a more unconventional application, they collaborated with the Ukrainian government on initiatives like the Diia citizen app, integrating voice capabilities to make government services more accessible, especially during the conflict. This included developing systems for mass communication and citizen support through voice interfaces. Furthermore, in the creative industry, while studios are cautiously adopting AI voiceovers, ElevenLabs is focused on a 'middle-to-middle' approach, where AI tools augment, rather than completely replace, human creativity. This involves enabling finer control over AI narration, like directing emotional delivery, and exploring applications in AI localization and interactive movie experiences. They believe AI will increasingly handle tedious tasks like scratch work and post-production repairs, freeing up human artists for higher-level creative contributions, provided the economic models and IP considerations are ethically resolved.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●People Referenced

11 Labs: Frontier Systems & AI Voice

Practical takeaways from this episode

Do This

Focus on being extremely problem-obsessed, understanding the customer's exact pain points.

Leverage community and creators early for product feedback and use case discovery.

Prioritize improving the core 'last mile' of technology (e.g., text-to-speech) for impactful product launches.

Innovate on voice characteristics and contextuality for more natural and expressive AI speech.

Consider collaboration and partnerships over strict competition, especially in frontier fields.

Price based on the value delivered to the customer, not the cost of operation.

Build safety features directly into AI models during development.

Explore both cascaded and fused model architectures, choosing based on reliability vs. speed needs.

Focus on high-quality, controllable, and expressive AI voice models.

When developing on-device models, prioritize quality over immediate accessibility.

Avoid This

Avoid the standard corporate approach to meetings and email communication if aiming for agility.

Don't underestimate the potential of technology initially adopted by gaming communities.

Do not focus on fixing all components of a complex pipeline at once; prioritize key areas.

Avoid relying solely on voice authentication for security in the future.

Don't approach AI development solely from a 'cost' perspective; focus on customer value.

Be wary of 'end-to-end' AI solutions that lack iterative refinement or initial creative input.

Don't assume traditional patent strategies are always beneficial for rapidly evolving tech.

Common Questions

The founders were inspired by the poor quality of dubbed foreign films in Poland, where a single monotone voice narrates all characters. They envisioned a future where content could be accessed in any language with natural tonality and emotion.

Topics

Ai-Ethics AI & Machine Learning Technology & Innovation Conversational AI Natural Language Processing Voice Cloning AI Voice Generation AI Dubbing Speech Synthesis

Mentioned in this video

Companies

11 Labs

A company specializing in frontier audio and speech AI, focusing on text-to-speech, transcription, and AI dubbing technologies. They offer a platform for businesses and creators to transform how they interact with audiences.

Google

A former employer of Mati Staniszewski and Peter, from which they drew lessons for their startup. Also mentioned as a hyperscaler with advanced text-to-speech capabilities.

Palantir

A former employer of Mati Staniszewski and Peter, from which they drew lessons for their startup.

Netflix

A company whose intellectual property approach to AI models is contrasted with the Western approach.

OpenAI

Company where James Bartlett worked on ChatGPT's advanced voice mode. Also mentioned in the context of closed-source models lagging behind open-source at one point.

ServiceNow

A platform from which 11 Labs' tooling can pull data for personalized interactions.

Revolute

A business customer that uses and has different AI models depending on their use case.

Disney

A company whose intellectual property approach to AI models is contrasted with the Western approach.

Deutsche Telekom

A business customer that uses and has different AI models depending on their use case.

Anthropic

A company mentioned for its significant ARR growth, contrasted with 11 Labs' growth.

Sesame

A company collaborating with 11 Labs, developing speech models. The CEO, Brendan, will be a future speaker in the class.

Ubiquity 6

A former company where Ankit (Andrew's former co-founder and CTO) worked.

Salesforce

A customer relationship management platform from which 11 Labs' tooling can pull data for personalized interactions.

Software & Apps

Discord

A communication platform initially explored by 11 Labs for running their company and for hosting a text-to-speech bot that gained traction.

Slack

A communication platform used by 11 Labs after experimenting with Discord, found to be easier for their internal communication.

Tortoise TTS

An open-source text-to-speech model created by James Bartlett, known for its human-like quality on short fragments but also for its slow generation time and instability.

Clara

A business customer that uses and has different AI models depending on their use case.

Dia

A central citizen app in Ukraine that provides access to government services and information via mobile device, enhanced with voice capabilities by 11 Labs.

People

Nat Friedman

Mutual friend who facilitated the introduction between the speaker and Mati Staniszewski.

James Bartlett

Creator of the Tortoise TTS open-source model, who previously worked at Google and later at OpenAI on ChatGPT's advanced voice mode.

Lex Fridman

Podcast host with whom 11 Labs worked to dub conversations with world leaders, demonstrating advanced AI localization.

Javier Milei

Argentinian politician whose speech was dubbed into English by 11 Labs, showcasing the company's AI localization capabilities.

Volodymyr Zelensky

President of Ukraine, whose conversation with Javier Milei was dubbed by 11 Labs.

Narendra Modi

Prime Minister of India, whose conversation was dubbed by 11 Labs.

Mafi Maki

A person whose voice 11 Labs has worked with for licensing purposes.

Michael Caine

An actor whose voice 11 Labs has worked with for licensing purposes.

Organizations

NVIDIA Inception Program

A program that provided free compute credits and accelerators to 11 Labs in their early stages.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

Want to know something specific about what's covered?

Key Insights

From Polish Dubbing to Global Voice Solutions

Innovating the Last Mile: Text-to-Speech Advancement

A Phased Approach to AI Audio Capabilities

The Future of Voice Agents: Cascaded vs. Fused

Business Growth and Collaborative Leadership

Addressing Security, Ethics, and Global Perspectives

On-Device Models and the Future Platform

Real-World Impact and Creative Applications

Mentioned in This Episode

11 Labs: Frontier Systems & AI Voice

Do This

Avoid This

Common Questions

Topics

Mentioned in this video

More from Stanford Online

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case

Ask anything from this episode.