What services are needed to create this AI voice translator?

The core services required are AssemblyAI for transcription, a Python translation library (like the one demonstrated), and Eleven Labs for text-to-speech synthesis with voice cloning.

How does the voice cloning work with Eleven Labs?

Eleven Labs offers instant voice cloning with just one minute of audio, or professional voice cloning with at least 30 minutes of audio for a higher quality, more flawless voice clone.

What is the role of Gradio in this project?

Gradio is used to build the user interface for the application, allowing users to easily record their voice, submit it, and receive the translated audio outputs in different languages.

Can I translate into languages other than Spanish, Turkish, and Japanese?

Yes, the Python translate module supports various languages. You just need to know the two-letter language codes. The 11 Labs model used is also multilingual, enabling speech generation in different languages.

Is building this app complex?

The tutorial presents a simplified version that is surprisingly easy to build. A more sophisticated interface is also possible with Gradio, though it requires more code.

Where can I find the code for this voice translator?

All the code for both the simplified and more complex versions of the application is available on the creator's GitHub repository, with the link provided in the video description.

Key Moments

Build an AI Voice Translator: Keep Your Voice in Any Language! (Python + Gradio Tutorial)

AssemblyAI

Science & Technology4 min read25 min video

Jun 25, 2024|27,966 views|947|56

gradio python tutorial gradio python speech recognition speech recognition python artificial intelligence tutorial assembly ai ai voice translator ai in python ai voice dubbing machine learning voice translater voice translator app

Save to Pod

Key Moments

TL;DR

Build a voice translator app using Python, AssemblyAI, and ElevenLabs to clone your voice.

Key Insights

The tutorial demonstrates how to build a voice translator that transcribes English speech, translates it to other languages, and then synthesizes the translated text using the user's own cloned voice.

Key technologies used are AssemblyAI for speech-to-text transcription, Python's translate module for text translation, and ElevenLabs for text-to-speech with voice cloning.

The Gradio library is utilized for creating a user-friendly web interface for the application, allowing for easy input of audio and display of translated audio outputs.

ElevenLabs offers different voice cloning options, including instant voice cloning (1 minute of audio) and professional voice cloning (30+ minutes of audio) for a more accurate replication of the user's voice.

The application can be customized with different Gradio interface layouts, from a simplified version to a more complex one with additional features like audio playback and download.

Potential use cases include sending personalized voicemails in a recipient's language, language learning practice, and real-time voice translation for calls.

INTRODUCTION TO THE VOICE TRANSLATOR APP

This tutorial showcases the creation of a voice translator that allows users to record themselves speaking English, which is then translated into multiple languages. The unique aspect of this app is its ability to use the user's own cloned voice for the synthesized speech in different languages, creating an uncanny yet exciting user experience. The process was described as surprisingly easy yet mind-blowing, involving a few key technological components.

CORE TECHNOLOGIES AND WORKFLOW

The voice translator app is built using three primary technologies. First, AssemblyAI is used for transcribing the initial English speech into text. Second, a Python translate module handles the translation of the English text into desired target languages. Finally, ElevenLabs is employed to convert the translated text into audio, crucially using the user's cloned voice. This three-step process forms the backbone of the application's functionality.

USER INTERFACE WITH GRADI0

The Gradio library is central to building the web interface for this application. The tutorial explains two methods: the simpler `gradio.Interface` for straightforward applications and `gradio.Blocks` for more customized layouts. For this tutorial, the `gradio.Interface` approach is demonstrated to keep the focus on the core functionality. The interface includes an audio input component (set to microphone input) and multiple audio output components, each labeled with a target language like Spanish, Turkish, or Japanese.

IMPLEMENTING ASSEMBLYAI FOR TRANSCRIPTION

The first functional component implemented is the audio transcription using AssemblyAI. The recorded audio file path is passed to a dedicated transcription function. This function initializes an AssemblyAI transcriber using an API key and calls the `transcribe` method on the audio file. Error handling is included by checking the transcription status response from AssemblyAI to ensure the process either completes successfully or raises an appropriate error, returning the transcription text.

TEXT TRANSLATION WITH PYTHON MODULE

Following transcription, the English text is translated into various languages. The tutorial utilizes Python's `translate` module, highlighting its flexibility with different providers (like the free default or paid options such as Microsoft Translate). For demonstration, separate translator instances are created for Spanish, Turkish, and Japanese, each configured with the source language (English) and the target language code. The `.translate()` method is then called on the English text for each instance.

ELEVENLABS FOR VOICE CLONING AND SPEECH SYNTHESIS

The final key step involves generating audio in the user's cloned voice using ElevenLabs. Users first need to obtain an API key and clone their voice, which can be done instantly with one minute of audio or professionally with 30+ minutes for higher fidelity. The tutorial shows how to integrate ElevenLabs' `client.text_to_speech` function, specifying the cloned voice ID, a multilingual model (like `11turbo-v2`), and recommended settings for stability, similarity, and style exaggeration. The output audio file path is then returned.

INTEGRATING FUNCTIONS AND HANDLING OUTPUTS

All three core functions—transcription, translation, and text-to-speech—are orchestrated within a main `voice_to_voice` function. This function takes the audio input, calls each service sequentially, and collects the resulting audio file paths for each translated language. These paths are then returned to the Gradio interface, which automatically displays them as playable audio components for the user to experience.

DEPLOYMENT AND DEMONSTRATION

After setting up the API keys for AssemblyAI and ElevenLabs, the application is ready to be run. The user can record audio via their microphone, submit it, and wait for the translated and cloned-voice audio outputs to appear in the specified language tabs. The tutorial emphasizes the 'eerie' yet exciting nature of hearing oneself speak in unfamiliar languages and suggests the code is available on GitHub for both simplified and more complex interface versions.

EXPLORING USE CASES AND FUTURE POSSIBILITIES

The tutorial concludes by discussing potential applications of this technology. Examples given include sending personalized WhatsApp voicemails in a friend's language, practicing pronunciation by mimicking one's own synthesized voice in a foreign language, and the future of real-time translation for phone calls, similar to Samsung's efforts. The presenter encourages viewers to share their ideas for utilizing this technology.

Mentioned in This Episode

●Software & Apps

●Companies

●Books

●Concepts

Building Your AI Voice Translator

Practical takeaways from this episode

Do This

Use Gradio for building the user interface.

Integrate AssemblyAI for accurate audio transcription.

Leverage Python's translate module for multi-language translation.

Utilize Eleven Labs for text-to-speech generation with your own cloned voice.

Clone your voice with Eleven Labs using instant or professional voice cloning for best results.

Convert generated audio file paths using pathlib for Gradio compatibility.

Store your AssemblyAI and Eleven Labs API keys securely.

Avoid This

Do not solely rely on the basic 'interface' option in Gradio if complex layouts are needed (consider 'blocks').

Do not forget to install necessary libraries like Gradio, AssemblyAI, and Eleven Labs.

Do not use default voice IDs in Eleven Labs if you want to use your cloned voice.

Do not pass raw file system paths directly to Gradio's audio components without path conversion.

Common Questions

You can build an AI voice translator by combining three key technologies: AssemblyAI for speech-to-text transcription, a Python translation module for language conversion, and Eleven Labs to synthesize the translated text into audio using your cloned voice.

Topics

AI Voice Translation Voice Cloning Python Programming Gradio Eleven Labs App Development Tutorial Multi-language Communication

Mentioned in this video

Software & Apps

11 Labs API

An authentication key required to use the 11 Labs service for text-to-speech generation.

AssemblyAI.transcribe

The specific function within the AssemblyAI Python SDK used to transcribe audio files.

uuid

A Python library imported to generate unique identifiers, used for saving the generated audio files.

pathlib

A Python module used to manipulate file paths, specifically converting generated audio paths to a format compatible with Gradio.

11 Labs client

The client object used to interact with the Eleven Labs API for text-to-speech generation.

Gradio

A Python library used for building and customizing the user interface for the AI voice translator application.

translator.translate

The method within the Python translate module used to perform the actual text translation.

Python translate module

A Python module used for translating transcribed text from English into different desired languages.

translate

The Python module used for translating text between different languages, specifically English to Spanish, Turkish, and Japanese in this tutorial.

Concepts

Professional voice cloning

A paid feature in Eleven Labs that requires at least 30 minutes of audio for a high-quality voice clone, up to 3 hours for a flawless clone.

Instant voice cloning

A feature in Eleven Labs requiring only 1 minute of audio to clone a user's voice for speech generation.

Gradio Blocks

A more customizable solution in Gradio for building apps, allowing for more detailed layout control and component grouping.

AssemblyAI transcription status error

An error status returned by AssemblyAI indicating a failure during the transcription process.

Gradio Interface

A simpler way to build Gradio apps where components are pre-connected, requiring only input and output specifications.

speech settings

Configurable parameters for Eleven Labs' text-to-speech generation, including stability, similarity, style exaggeration, and speaker boost.

multilingual model

A specific model in Eleven Labs designed to support the generation of speech in multiple languages.

Books

Smitha's video

A recommended video tutorial by another creator on how to build an AI voice bot (ChatGPT bot with voice interaction).

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free