Transcribe a live phone call with Python - Flask tutorial
Key Moments
Build a real-time call transcription app with Python, Flask, Twilio, and AssemblyAI.
Key Insights
Integrate Python, Flask, Twilio, and AssemblyAI for real-time call transcription.
Use ngrok to expose local Flask application to Twilio without cloud deployment.
Configure Twilio to stream audio to a Flask WebSocket endpoint.
Process audio data using AssemblyAI's real-time transcription API.
Handle different Twilio WebSocket message types (connected, start, media, stop).
Decode base64 encoded audio from Twilio and convert to μ-law format for AssemblyAI.
Automate Twilio number configuration using the Twilio REST client and ngrok.
PROJECT OVERVIEW AND ARCHITECTURE
This tutorial demonstrates how to build a real-time phone call transcription application using Python, Flask, Twilio, and AssemblyAI. The core concept involves configuring a Twilio number to forward incoming call audio streams to a local Flask application via an ngrok tunnel. AssemblyAI's real-time transcription service then processes this audio, returning partial and final transcripts that are displayed in the console.
SETUP AND DEPENDENCY INSTALLATION
To begin, users need accounts with AssemblyAI, Twilio, and ngrok, along with Python installed. Key steps include creating a project directory, setting up a .env file for API keys and credentials (ngrok auth token, Twilio Account SID, API key, API secret, and AssemblyAI API key), and creating a .gitignore file to exclude sensitive information and virtual environments from version control.
FLASK APPLICATION SKELETON AND ENDPOINTS
A Flask application is created using `flask` and `flask-sock` for WebSocket support. Two main endpoints are defined: `/` (incoming call route) to handle Twilio's initial POST request with TwiML instructions, and `/websocket` to manage the audio stream. The application needs to respond to POST requests from Twilio and return TwiML that directs the call audio to the application's WebSocket.
TWILIO INTEGRATION AND AUDIO STREAMING
The Flask application's TwiML response is modified to include a `<Connect>` tag with a `<Stream>` instruction. This tells Twilio to send the incoming audio stream to a WebSocket endpoint (e.g., `wss://{request.host}/websocket`). The WebSocket endpoint in Flask receives this audio data, which is base64 encoded by Twilio. This payload is then decoded and converted to the μ-law format required by AssemblyAI.
ASSEMBLYAI REAL-TIME TRANSCRIPTION
A separate Python file (`twilio_transcriber.py`) is created to handle AssemblyAI integration. This file defines functions for WebSocket events (`on_open`, `on_data`, `on_error`, `on_close`) and a `TwilioTranscriber` class that inherits from `AssemblyAI.RealtimeTranscriber`. The `on_data` function differentiates between partial and final transcripts, printing them dynamically to the console. Partial transcripts are printed with a carriage return to update in place, while final transcripts appear on new lines.
CONNECTING TWILIO AUDIO TO ASSEMBLYAI
In the Flask `transcription_websocket` function, an instance of `TwilioTranscriber` is created and connected to AssemblyAI's real-time service. As audio data (media messages) is received from Twilio, it's decoded from base64, converted to μ-law format, and streamed to AssemblyAI using the transcriber's `stream` method. The transcriber then handles printing the transcribed text as it's generated.
AUTOMATING TWILIO NUMBER CONFIGURATION
To streamline the setup, the tutorial shows how to automate the process of updating the Twilio number's voice webhook. By loading Twilio and ngrok credentials, and using the `twilio.rest.Client`, the application can programmatically find the correct Twilio number and update its `voice_url` to the dynamically generated ngrok tunnel URL. This eliminates the need for manual configuration in the Twilio console.
RUNNING AND TESTING THE APPLICATION
Before running, ensure any existing ngrok tunnels are closed. The Flask application is started with `python main.py`. Users then call their Twilio number, and the audio is streamed, transcribed by AssemblyAI, and displayed in the terminal in real-time. The process concludes with the transcriber session closing when the call ends.
Mentioned in This Episode
●Software & Apps
●Companies
●Concepts
Real-Time Phone Call Transcription Setup
Practical takeaways from this episode
Do This
Avoid This
Common Questions
You can transcribe phone calls in real time by using Twilio to receive calls, ngrok to expose your local server, Flask to build a web application with WebSocket support, and AssemblyAI's real-time transcription service to process the audio stream.
Topics
Mentioned in this video
A tunneling service used to expose the local Flask application to the internet for Twilio to access.
A library used to add WebSocket support to Flask applications.
A type of machine learning model discussed as an advancing area of AI.
A micro web framework for Python used to create the web application that handles incoming calls and websockets.
Twilio Markup Language, used to instruct Twilio on how to handle calls, such as saying messages or connecting to websockets.
The audio encoding format used by Twilio for transmitting audio data, compatible with AssemblyAI.
More from AssemblyAI
View all 48 summaries
1 minUniversal-3 Pro Streaming: Subway test
2 minUniversal-3 Pro: Office Icebreakers
20 minBuilding Quso.ai: Autonomous social media, the death of traditional SaaS, and founder lessons
61 minPrompt Engineering Workshop: Universal-3 Pro
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free