Key Moments

Build an AI Voice Translator: Keep Your Voice in Any Language! (Python + Gradio Tutorial)

AssemblyAIAssemblyAI
Science & Technology4 min read25 min video
Jun 25, 2024|27,819 views|946|56
Save to Pod
TL;DR

Build a voice translator app using Python, AssemblyAI, and ElevenLabs to clone your voice.

Key Insights

1

The tutorial demonstrates how to build a voice translator that transcribes English speech, translates it to other languages, and then synthesizes the translated text using the user's own cloned voice.

2

Key technologies used are AssemblyAI for speech-to-text transcription, Python's translate module for text translation, and ElevenLabs for text-to-speech with voice cloning.

3

The Gradio library is utilized for creating a user-friendly web interface for the application, allowing for easy input of audio and display of translated audio outputs.

4

ElevenLabs offers different voice cloning options, including instant voice cloning (1 minute of audio) and professional voice cloning (30+ minutes of audio) for a more accurate replication of the user's voice.

5

The application can be customized with different Gradio interface layouts, from a simplified version to a more complex one with additional features like audio playback and download.

6

Potential use cases include sending personalized voicemails in a recipient's language, language learning practice, and real-time voice translation for calls.

INTRODUCTION TO THE VOICE TRANSLATOR APP

This tutorial showcases the creation of a voice translator that allows users to record themselves speaking English, which is then translated into multiple languages. The unique aspect of this app is its ability to use the user's own cloned voice for the synthesized speech in different languages, creating an uncanny yet exciting user experience. The process was described as surprisingly easy yet mind-blowing, involving a few key technological components.

CORE TECHNOLOGIES AND WORKFLOW

The voice translator app is built using three primary technologies. First, AssemblyAI is used for transcribing the initial English speech into text. Second, a Python translate module handles the translation of the English text into desired target languages. Finally, ElevenLabs is employed to convert the translated text into audio, crucially using the user's cloned voice. This three-step process forms the backbone of the application's functionality.

USER INTERFACE WITH GRADI0

The Gradio library is central to building the web interface for this application. The tutorial explains two methods: the simpler `gradio.Interface` for straightforward applications and `gradio.Blocks` for more customized layouts. For this tutorial, the `gradio.Interface` approach is demonstrated to keep the focus on the core functionality. The interface includes an audio input component (set to microphone input) and multiple audio output components, each labeled with a target language like Spanish, Turkish, or Japanese.

IMPLEMENTING ASSEMBLYAI FOR TRANSCRIPTION

The first functional component implemented is the audio transcription using AssemblyAI. The recorded audio file path is passed to a dedicated transcription function. This function initializes an AssemblyAI transcriber using an API key and calls the `transcribe` method on the audio file. Error handling is included by checking the transcription status response from AssemblyAI to ensure the process either completes successfully or raises an appropriate error, returning the transcription text.

TEXT TRANSLATION WITH PYTHON MODULE

Following transcription, the English text is translated into various languages. The tutorial utilizes Python's `translate` module, highlighting its flexibility with different providers (like the free default or paid options such as Microsoft Translate). For demonstration, separate translator instances are created for Spanish, Turkish, and Japanese, each configured with the source language (English) and the target language code. The `.translate()` method is then called on the English text for each instance.

ELEVENLABS FOR VOICE CLONING AND SPEECH SYNTHESIS

The final key step involves generating audio in the user's cloned voice using ElevenLabs. Users first need to obtain an API key and clone their voice, which can be done instantly with one minute of audio or professionally with 30+ minutes for higher fidelity. The tutorial shows how to integrate ElevenLabs' `client.text_to_speech` function, specifying the cloned voice ID, a multilingual model (like `11turbo-v2`), and recommended settings for stability, similarity, and style exaggeration. The output audio file path is then returned.

INTEGRATING FUNCTIONS AND HANDLING OUTPUTS

All three core functions—transcription, translation, and text-to-speech—are orchestrated within a main `voice_to_voice` function. This function takes the audio input, calls each service sequentially, and collects the resulting audio file paths for each translated language. These paths are then returned to the Gradio interface, which automatically displays them as playable audio components for the user to experience.

DEPLOYMENT AND DEMONSTRATION

After setting up the API keys for AssemblyAI and ElevenLabs, the application is ready to be run. The user can record audio via their microphone, submit it, and wait for the translated and cloned-voice audio outputs to appear in the specified language tabs. The tutorial emphasizes the 'eerie' yet exciting nature of hearing oneself speak in unfamiliar languages and suggests the code is available on GitHub for both simplified and more complex interface versions.

EXPLORING USE CASES AND FUTURE POSSIBILITIES

The tutorial concludes by discussing potential applications of this technology. Examples given include sending personalized WhatsApp voicemails in a friend's language, practicing pronunciation by mimicking one's own synthesized voice in a foreign language, and the future of real-time translation for phone calls, similar to Samsung's efforts. The presenter encourages viewers to share their ideas for utilizing this technology.

Building Your AI Voice Translator

Practical takeaways from this episode

Do This

Use Gradio for building the user interface.
Integrate AssemblyAI for accurate audio transcription.
Leverage Python's translate module for multi-language translation.
Utilize Eleven Labs for text-to-speech generation with your own cloned voice.
Clone your voice with Eleven Labs using instant or professional voice cloning for best results.
Convert generated audio file paths using pathlib for Gradio compatibility.
Store your AssemblyAI and Eleven Labs API keys securely.

Avoid This

Do not solely rely on the basic 'interface' option in Gradio if complex layouts are needed (consider 'blocks').
Do not forget to install necessary libraries like Gradio, AssemblyAI, and Eleven Labs.
Do not use default voice IDs in Eleven Labs if you want to use your cloned voice.
Do not pass raw file system paths directly to Gradio's audio components without path conversion.

Common Questions

You can build an AI voice translator by combining three key technologies: AssemblyAI for speech-to-text transcription, a Python translation module for language conversion, and Eleven Labs to synthesize the translated text into audio using your cloned voice.

Topics

Mentioned in this video

More from AssemblyAI

View all 48 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free