Key Moments

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

Stanford OnlineStanford Online
Education7 min read67 min video
May 4, 2026|793 views|18
Save to Pod
TL;DR

ElevenLabs' voice technology achieved near-perfect AI dubbing in 2024, but its future lies in real-time, emotionally intelligent voice agents, with current limitations in natural interaction.

Key Insights

1

ElevenLabs started with a focus on fixing AI dubbing, inspired by the poor quality of dubbed movies in Poland where one male voice narrates all characters.

2

The company's initial approach involved a cascaded system of transcription, translation (LLM), and text-to-speech, but improved LLM translation was a bottleneck until 2024.

3

In 2022, ElevenLabs pivoted to focus on improving text-to-speech quality, specifically voice replication and natural delivery, inspired by open-source models like Tortoise.

4

By 2024, ElevenLabs achieved AI localization, enabling high-quality dubbing in original voices, demonstrated by dubbing Javier Milei's UN speech into English.

5

The future of voice agents, according to ElevenLabs, lies in fused models for speed or cascaded models for reliability, with a current emphasis on emotional expressivity and controllable interactions.

6

ElevenLabs achieved over $330 million in revenue in 2025, growing to over $430 million ARR, with a team of over 450 people, and prioritizes value-based pricing for its services.

From Polish Dubbing to Global Voice Solutions

ElevenLabs' journey began with a deeply personal problem: the dismal quality of movie dubbing in Poland, where a single, monotone male voice narrates all characters, stripping films of emotional nuance. This frustration, shared by co-founders Mati Staniszewski and Piotr Dabkowski, who both came from Google and Palantir, fueled their mission to revolutionize audio and voice technology. They aimed to fix foundational research in audio and voice and build applied AI products. Initially, they considered tackling AI dubbing for all languages, a complex system requiring transcription, translation, and text-to-speech. However, early market research revealed a more immediate need: simpler voiceover corrections and the ability to replace voice segments within existing recordings, even in the original language. This led to a strategic pivot, focusing on improving the text-to-speech (TTS) component as a more accessible entry point. The company also embraced a Product-Led Growth (PLG) motion, engaging closely with creators and developers, initially through platforms like Discord, to rapidly iterate based on real-world use cases and feedback. This community-centric approach remains a cornerstone of their development, fostering contributions and incorporating user data to refine models and identify new applications.

Innovating the Last Mile: Text-to-Speech Advancement

In 2022, as the AI landscape was largely dominated by discussions of crypto and the metaverse, ElevenLabs identified text-to-speech as the most critical component to perfect. The state-of-the-art in TTS at the time lacked the ability to replicate voice characteristics accurately or maintain the natural flow and emotional intonation of human speech across longer passages. Drawing inspiration from emerging transformer architectures and diffusion models, and crucially, from open-source advancements like James Becker's Tortoise TTS, ElevenLabs focused on improving the generation and voice replication capabilities. Their key innovations centered on creating more flexible voice models that could better capture subtle nuances and context, moving away from hard-coded parameters like age and accent. This strategic prioritization of TTS paid off, enabling them to build a robust foundation for future expansion into more complex audio applications, all while bootstrapping the company on modest initial compute resources.

A Phased Approach to AI Audio Capabilities

ElevenLabs strategically rolled out its capabilities in phases, reflecting the complexity of audio processing. In 2022, the primary breakthrough was in high-quality, natural-sounding English text-to-speech. This was followed in 2023 by advancements in cross-language narration, voice recreation for users, a voice marketplace, and specialized tools for authors creating audiobooks. The significant leap forward occurred in 2024 with the integration of improved transcription, LLM-based translation, and advanced speech generation models, culminating in what they term 'AI localization.' This enabled the high-fidelity dubbing of speeches and conversations into different languages while preserving the original speaker's distinct voice and emotional delivery. Examples include Javier Milei's UN speech and conversations with world leaders like Volodymyr Zelenskyy and Narendra Modi.

The Future of Voice Agents: Cascaded vs. Fused

Looking ahead, the frontier of voice AI for ElevenLabs lies in creating sophisticated voice agents capable of real-time, emotionally intelligent interaction. They are exploring two primary architectural approaches: cascaded systems and fused models. Cascaded systems, which maintain separate models for speech-to-text, LLMs, and text-to-speech, offer higher reliability and modularity, making them suitable for enterprise applications requiring precision, such as customer support or financial transactions. Fused models, conversely, aim to combine these functions into a single, end-to-end model, promising lower latency and faster responses, potentially ideal for companion or informal interactions. While fused models show promise for speed, ElevenLabs currently favors the cascaded approach for business applications due to its robustness and controllability. A key area of development is enhancing emotional expressivity and controllability, enabling agents to respond with nuanced emotions—excitement, reassurance, or empathy—based on user input. This requires significant investment in data labeling and model training to accurately interpret and generate emotional speech.

Business Growth and Collaborative Leadership

ElevenLabs has experienced explosive growth, reaching over $430 million in ARR by early 2024, a testament to their product-market fit and strategic execution. The company scaled to over 450 employees, with key hubs in London, New York, Warsaw, and San Francisco, maintaining a culture of small, empowered teams focused on rapid iteration and customer problem-solving. Their business model balances enterprise solutions with a significant PLG component, ensuring broad accessibility. Pricing is firmly anchored in the value delivered to customers, aiming to capture a fraction of that value rather than being cost-driven. This philosophy, coupled with close collaboration with major businesses, allows for predictable revenue forecasting. Notably, ElevenLabs also champions a collaborative approach within the AI ecosystem, even with competitors, exemplified by their support for initiatives like Sesame and by open-sourcing certain technologies. Staniszewski highlighted this ethos, stating that partnerships and shared progress are crucial for advancing the frontier, a perspective that contrasts with some more insular industry dynamics.

Addressing Security, Ethics, and Global Perspectives

Security and ethical considerations are paramount for ElevenLabs. They build safety features directly into their models, including content traceability, moderation against fraud, and watermarking systems to identify AI-generated audio. They advocate for robust security measures beyond voice authentication, given the ease of voice replication, and have even developed creative 'counter-offensive' uses for voice agents against scammers. Regarding global deployments, ElevenLabs is aligned with Western allies, adhering to legal guidance and actively combating 'distillation attacks.' While acknowledging excellent audio models emerging from regions like China, they focus on outcompeting through superior service and proprietary technology. They also emphasize the importance of open-source contributions from Western labs to foster innovation globally, aiming to provide the same powerful tools to individual creators as to large enterprises.

On-Device Models and the Future Platform

ElevenLabs is making strides in bringing its models on-device, aiming to offer constrained, high-quality TTS capabilities for broader accessibility. However, they acknowledge a quality gap remains between on-device and cloud-based solutions, particularly concerning interactivity, emotional transfer, and advanced features. Their strategy prioritizes achieving top-tier quality before fully committing to on-device deployment. Looking five years ahead, ElevenLabs envisions itself as a go-to platform for businesses and creators, providing not just advanced audio models but comprehensive tooling for applied AI. This involves deep customization for specific business needs, integration with various communication channels (phone, chat, email), database connectivity, and robust evaluation frameworks. They see a future where 3-5 dominant platforms facilitate conversational interactions between businesses and audiences, and they aim to be a leader in this space, enabling seamless application development and deployment.

Real-World Impact and Creative Applications

Beyond commercial applications, ElevenLabs is deeply involved in profoundly impactful projects. They have helped nearly 10,000 individuals who lost their voice due to conditions like ALS or throat cancer to synthesize new voices, enabling them to communicate naturally again. In a more unconventional application, they collaborated with the Ukrainian government on initiatives like the Diia citizen app, integrating voice capabilities to make government services more accessible, especially during the conflict. This included developing systems for mass communication and citizen support through voice interfaces. Furthermore, in the creative industry, while studios are cautiously adopting AI voiceovers, ElevenLabs is focused on a 'middle-to-middle' approach, where AI tools augment, rather than completely replace, human creativity. This involves enabling finer control over AI narration, like directing emotional delivery, and exploring applications in AI localization and interactive movie experiences. They believe AI will increasingly handle tedious tasks like scratch work and post-production repairs, freeing up human artists for higher-level creative contributions, provided the economic models and IP considerations are ethically resolved.

11 Labs: Frontier Systems & AI Voice

Practical takeaways from this episode

Do This

Focus on being extremely problem-obsessed, understanding the customer's exact pain points.
Leverage community and creators early for product feedback and use case discovery.
Prioritize improving the core 'last mile' of technology (e.g., text-to-speech) for impactful product launches.
Innovate on voice characteristics and contextuality for more natural and expressive AI speech.
Consider collaboration and partnerships over strict competition, especially in frontier fields.
Price based on the value delivered to the customer, not the cost of operation.
Build safety features directly into AI models during development.
Explore both cascaded and fused model architectures, choosing based on reliability vs. speed needs.
Focus on high-quality, controllable, and expressive AI voice models.
When developing on-device models, prioritize quality over immediate accessibility.

Avoid This

Avoid the standard corporate approach to meetings and email communication if aiming for agility.
Don't underestimate the potential of technology initially adopted by gaming communities.
Do not focus on fixing all components of a complex pipeline at once; prioritize key areas.
Avoid relying solely on voice authentication for security in the future.
Don't approach AI development solely from a 'cost' perspective; focus on customer value.
Be wary of 'end-to-end' AI solutions that lack iterative refinement or initial creative input.
Don't assume traditional patent strategies are always beneficial for rapidly evolving tech.

Common Questions

The founders were inspired by the poor quality of dubbed foreign films in Poland, where a single monotone voice narrates all characters. They envisioned a future where content could be accessed in any language with natural tonality and emotion.

Topics

Mentioned in this video

Companies
11 Labs

A company specializing in frontier audio and speech AI, focusing on text-to-speech, transcription, and AI dubbing technologies. They offer a platform for businesses and creators to transform how they interact with audiences.

Google

A former employer of Mati Staniszewski and Peter, from which they drew lessons for their startup. Also mentioned as a hyperscaler with advanced text-to-speech capabilities.

Palantir

A former employer of Mati Staniszewski and Peter, from which they drew lessons for their startup.

Netflix

A company whose intellectual property approach to AI models is contrasted with the Western approach.

OpenAI

Company where James Bartlett worked on ChatGPT's advanced voice mode. Also mentioned in the context of closed-source models lagging behind open-source at one point.

ServiceNow

A platform from which 11 Labs' tooling can pull data for personalized interactions.

Revolute

A business customer that uses and has different AI models depending on their use case.

Disney

A company whose intellectual property approach to AI models is contrasted with the Western approach.

Deutsche Telekom

A business customer that uses and has different AI models depending on their use case.

Anthropic

A company mentioned for its significant ARR growth, contrasted with 11 Labs' growth.

Sesame

A company collaborating with 11 Labs, developing speech models. The CEO, Brendan, will be a future speaker in the class.

Ubiquity 6

A former company where Ankit (Andrew's former co-founder and CTO) worked.

Salesforce

A customer relationship management platform from which 11 Labs' tooling can pull data for personalized interactions.

More from Stanford Online

View all 39 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free