Key Moments
Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems
Key Moments
ElevenLabs' voice technology achieved near-perfect AI dubbing in 2024, but its future lies in real-time, emotionally intelligent voice agents, with current limitations in natural interaction.
Key Insights
ElevenLabs started with a focus on fixing AI dubbing, inspired by the poor quality of dubbed movies in Poland where one male voice narrates all characters.
The company's initial approach involved a cascaded system of transcription, translation (LLM), and text-to-speech, but improved LLM translation was a bottleneck until 2024.
In 2022, ElevenLabs pivoted to focus on improving text-to-speech quality, specifically voice replication and natural delivery, inspired by open-source models like Tortoise.
By 2024, ElevenLabs achieved AI localization, enabling high-quality dubbing in original voices, demonstrated by dubbing Javier Milei's UN speech into English.
The future of voice agents, according to ElevenLabs, lies in fused models for speed or cascaded models for reliability, with a current emphasis on emotional expressivity and controllable interactions.
ElevenLabs achieved over $330 million in revenue in 2025, growing to over $430 million ARR, with a team of over 450 people, and prioritizes value-based pricing for its services.
From Polish Dubbing to Global Voice Solutions
ElevenLabs' journey began with a deeply personal problem: the dismal quality of movie dubbing in Poland, where a single, monotone male voice narrates all characters, stripping films of emotional nuance. This frustration, shared by co-founders Mati Staniszewski and Piotr Dabkowski, who both came from Google and Palantir, fueled their mission to revolutionize audio and voice technology. They aimed to fix foundational research in audio and voice and build applied AI products. Initially, they considered tackling AI dubbing for all languages, a complex system requiring transcription, translation, and text-to-speech. However, early market research revealed a more immediate need: simpler voiceover corrections and the ability to replace voice segments within existing recordings, even in the original language. This led to a strategic pivot, focusing on improving the text-to-speech (TTS) component as a more accessible entry point. The company also embraced a Product-Led Growth (PLG) motion, engaging closely with creators and developers, initially through platforms like Discord, to rapidly iterate based on real-world use cases and feedback. This community-centric approach remains a cornerstone of their development, fostering contributions and incorporating user data to refine models and identify new applications.
Innovating the Last Mile: Text-to-Speech Advancement
In 2022, as the AI landscape was largely dominated by discussions of crypto and the metaverse, ElevenLabs identified text-to-speech as the most critical component to perfect. The state-of-the-art in TTS at the time lacked the ability to replicate voice characteristics accurately or maintain the natural flow and emotional intonation of human speech across longer passages. Drawing inspiration from emerging transformer architectures and diffusion models, and crucially, from open-source advancements like James Becker's Tortoise TTS, ElevenLabs focused on improving the generation and voice replication capabilities. Their key innovations centered on creating more flexible voice models that could better capture subtle nuances and context, moving away from hard-coded parameters like age and accent. This strategic prioritization of TTS paid off, enabling them to build a robust foundation for future expansion into more complex audio applications, all while bootstrapping the company on modest initial compute resources.
A Phased Approach to AI Audio Capabilities
ElevenLabs strategically rolled out its capabilities in phases, reflecting the complexity of audio processing. In 2022, the primary breakthrough was in high-quality, natural-sounding English text-to-speech. This was followed in 2023 by advancements in cross-language narration, voice recreation for users, a voice marketplace, and specialized tools for authors creating audiobooks. The significant leap forward occurred in 2024 with the integration of improved transcription, LLM-based translation, and advanced speech generation models, culminating in what they term 'AI localization.' This enabled the high-fidelity dubbing of speeches and conversations into different languages while preserving the original speaker's distinct voice and emotional delivery. Examples include Javier Milei's UN speech and conversations with world leaders like Volodymyr Zelenskyy and Narendra Modi.
The Future of Voice Agents: Cascaded vs. Fused
Looking ahead, the frontier of voice AI for ElevenLabs lies in creating sophisticated voice agents capable of real-time, emotionally intelligent interaction. They are exploring two primary architectural approaches: cascaded systems and fused models. Cascaded systems, which maintain separate models for speech-to-text, LLMs, and text-to-speech, offer higher reliability and modularity, making them suitable for enterprise applications requiring precision, such as customer support or financial transactions. Fused models, conversely, aim to combine these functions into a single, end-to-end model, promising lower latency and faster responses, potentially ideal for companion or informal interactions. While fused models show promise for speed, ElevenLabs currently favors the cascaded approach for business applications due to its robustness and controllability. A key area of development is enhancing emotional expressivity and controllability, enabling agents to respond with nuanced emotions—excitement, reassurance, or empathy—based on user input. This requires significant investment in data labeling and model training to accurately interpret and generate emotional speech.
Business Growth and Collaborative Leadership
ElevenLabs has experienced explosive growth, reaching over $430 million in ARR by early 2024, a testament to their product-market fit and strategic execution. The company scaled to over 450 employees, with key hubs in London, New York, Warsaw, and San Francisco, maintaining a culture of small, empowered teams focused on rapid iteration and customer problem-solving. Their business model balances enterprise solutions with a significant PLG component, ensuring broad accessibility. Pricing is firmly anchored in the value delivered to customers, aiming to capture a fraction of that value rather than being cost-driven. This philosophy, coupled with close collaboration with major businesses, allows for predictable revenue forecasting. Notably, ElevenLabs also champions a collaborative approach within the AI ecosystem, even with competitors, exemplified by their support for initiatives like Sesame and by open-sourcing certain technologies. Staniszewski highlighted this ethos, stating that partnerships and shared progress are crucial for advancing the frontier, a perspective that contrasts with some more insular industry dynamics.
Addressing Security, Ethics, and Global Perspectives
Security and ethical considerations are paramount for ElevenLabs. They build safety features directly into their models, including content traceability, moderation against fraud, and watermarking systems to identify AI-generated audio. They advocate for robust security measures beyond voice authentication, given the ease of voice replication, and have even developed creative 'counter-offensive' uses for voice agents against scammers. Regarding global deployments, ElevenLabs is aligned with Western allies, adhering to legal guidance and actively combating 'distillation attacks.' While acknowledging excellent audio models emerging from regions like China, they focus on outcompeting through superior service and proprietary technology. They also emphasize the importance of open-source contributions from Western labs to foster innovation globally, aiming to provide the same powerful tools to individual creators as to large enterprises.
On-Device Models and the Future Platform
ElevenLabs is making strides in bringing its models on-device, aiming to offer constrained, high-quality TTS capabilities for broader accessibility. However, they acknowledge a quality gap remains between on-device and cloud-based solutions, particularly concerning interactivity, emotional transfer, and advanced features. Their strategy prioritizes achieving top-tier quality before fully committing to on-device deployment. Looking five years ahead, ElevenLabs envisions itself as a go-to platform for businesses and creators, providing not just advanced audio models but comprehensive tooling for applied AI. This involves deep customization for specific business needs, integration with various communication channels (phone, chat, email), database connectivity, and robust evaluation frameworks. They see a future where 3-5 dominant platforms facilitate conversational interactions between businesses and audiences, and they aim to be a leader in this space, enabling seamless application development and deployment.
Real-World Impact and Creative Applications
Beyond commercial applications, ElevenLabs is deeply involved in profoundly impactful projects. They have helped nearly 10,000 individuals who lost their voice due to conditions like ALS or throat cancer to synthesize new voices, enabling them to communicate naturally again. In a more unconventional application, they collaborated with the Ukrainian government on initiatives like the Diia citizen app, integrating voice capabilities to make government services more accessible, especially during the conflict. This included developing systems for mass communication and citizen support through voice interfaces. Furthermore, in the creative industry, while studios are cautiously adopting AI voiceovers, ElevenLabs is focused on a 'middle-to-middle' approach, where AI tools augment, rather than completely replace, human creativity. This involves enabling finer control over AI narration, like directing emotional delivery, and exploring applications in AI localization and interactive movie experiences. They believe AI will increasingly handle tedious tasks like scratch work and post-production repairs, freeing up human artists for higher-level creative contributions, provided the economic models and IP considerations are ethically resolved.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●People Referenced
11 Labs: Frontier Systems & AI Voice
Practical takeaways from this episode
Do This
Avoid This
Common Questions
The founders were inspired by the poor quality of dubbed foreign films in Poland, where a single monotone voice narrates all characters. They envisioned a future where content could be accessed in any language with natural tonality and emotion.
Topics
Mentioned in this video
A company specializing in frontier audio and speech AI, focusing on text-to-speech, transcription, and AI dubbing technologies. They offer a platform for businesses and creators to transform how they interact with audiences.
A former employer of Mati Staniszewski and Peter, from which they drew lessons for their startup. Also mentioned as a hyperscaler with advanced text-to-speech capabilities.
A former employer of Mati Staniszewski and Peter, from which they drew lessons for their startup.
A company whose intellectual property approach to AI models is contrasted with the Western approach.
Company where James Bartlett worked on ChatGPT's advanced voice mode. Also mentioned in the context of closed-source models lagging behind open-source at one point.
A platform from which 11 Labs' tooling can pull data for personalized interactions.
A business customer that uses and has different AI models depending on their use case.
A company whose intellectual property approach to AI models is contrasted with the Western approach.
A business customer that uses and has different AI models depending on their use case.
A company mentioned for its significant ARR growth, contrasted with 11 Labs' growth.
A company collaborating with 11 Labs, developing speech models. The CEO, Brendan, will be a future speaker in the class.
A former company where Ankit (Andrew's former co-founder and CTO) worked.
A customer relationship management platform from which 11 Labs' tooling can pull data for personalized interactions.
A communication platform initially explored by 11 Labs for running their company and for hosting a text-to-speech bot that gained traction.
A communication platform used by 11 Labs after experimenting with Discord, found to be easier for their internal communication.
An open-source text-to-speech model created by James Bartlett, known for its human-like quality on short fragments but also for its slow generation time and instability.
A business customer that uses and has different AI models depending on their use case.
A central citizen app in Ukraine that provides access to government services and information via mobile device, enhanced with voice capabilities by 11 Labs.
Mutual friend who facilitated the introduction between the speaker and Mati Staniszewski.
Creator of the Tortoise TTS open-source model, who previously worked at Google and later at OpenAI on ChatGPT's advanced voice mode.
Podcast host with whom 11 Labs worked to dub conversations with world leaders, demonstrating advanced AI localization.
Argentinian politician whose speech was dubbed into English by 11 Labs, showcasing the company's AI localization capabilities.
President of Ukraine, whose conversation with Javier Milei was dubbed by 11 Labs.
Prime Minister of India, whose conversation was dubbed by 11 Labs.
A person whose voice 11 Labs has worked with for licensing purposes.
An actor whose voice 11 Labs has worked with for licensing purposes.
More from Stanford Online
View all 39 summaries
58 minStanford CS153 Frontier Systems | Amit Jain from Luma AI on Unified Intelligence Systems
62 minStanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence
56 minStanford's Code in Place Info Session with Mehran Sahami
78 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free