What technologies are used for real-time call transcription?

The core technologies used are Twilio for call handling, WebSockets for real-time data streaming, Node.js with Express for the server, and AssemblyAI for the AI-powered transcription.

How do I connect Twilio to my local server for call streaming?

You need to use a tool like ngrok to create a public URL that forwards to your local development server, allowing Twilio to send call data to your application.

What is TwiML and how is it used here?

TwiML (Twilio Markup Language) is used to instruct Twilio on how to handle incoming calls. In this tutorial, it's used to tell Twilio to start streaming the call audio to a specific WebSocket endpoint.

How does AssemblyAI receive audio data?

Audio data from Twilio is received via WebSocket, converted from mu-law encoding to PCM, and then sent as base64 encoded data chunks to the AssemblyAI API WebSocket endpoint.

How is the transcribed text displayed on a website?

A second WebSocket connection is established between the server and the client-side HTML. As transcriptions are received from AssemblyAI, they are sent through this WebSocket to be displayed on the web page in real-time.

What are the requirements for sending audio to AssemblyAI?

AssemblyAI requires audio data to be in PCM encoding. Additionally, you need to send chunks of audio that meet a minimum duration, typically 100 milliseconds, to ensure effective transcription.

Key Moments

Transcribe Twilio Phone Calls in Real-Time with AssemblyAI | JavaScript WebSockets Tutorial

AssemblyAI

People & Blogs3 min read23 min video

Feb 23, 2022|19,784 views|246|15

Save to Pod

Key Moments

TL;DR

Transcribe Twilio phone calls in real-time using AssemblyAI, JavaScript, and WebSockets.

Key Insights

The tutorial demonstrates real-time transcription of Twilio phone calls by streaming audio data via WebSockets.

It involves setting up a Node.js Express server to handle Twilio Media Streams and WebSocket connections.

Twilio's TwiML is used to instruct Twilio to stream incoming call audio to the WebSocket endpoint.

AssemblyAI's Real-Time Streaming API is integrated to process the audio stream and return transcriptions.

The process requires converting Twilio's audio format (u-law) to a format compatible with AssemblyAI (PCM).

The transcriptions are then displayed in real-time on a web page using another WebSocket connection.

SERVER SETUP AND WEBSOCKET CONNECTION

The initial step involves setting up a basic Node.js web server using Express and the 'ws' library for handling WebSocket connections. This server is configured to listen on port 8080. A connection event listener is established to log when a new WebSocket client connects. Additionally, a simple GET route is created for the home page, which currently returns 'Hello, World!' to confirm the server is running. This foundational setup allows for real-time data exchange, crucial for the subsequent streaming and transcription processes.

TWILIO MEDIA STREAM INTEGRATION

To enable Twilio call transcription, the tutorial leverages Twilio Media Streams and TwiML. A POST endpoint is defined in the Express server to receive incoming requests from Twilio when a call is made. This endpoint must return a TwiML response. The TwiML instructs Twilio to start a stream and send the audio data to the previously established WebSocket endpoint. For testing purposes, the ngrok tool is used to expose the local server to a public URL, which is then configured in the Twilio dashboard to trigger the webhook.

HANDLING TWILIO'S TWMML AND EVENTS

The server needs to process incoming messages from Twilio over the WebSocket connection. These messages are in JSON format and contain different event types: 'connected', 'start', 'media', and 'stop'. The code parses these messages and uses a switch statement to handle each event. For the 'media' event, which contains the actual audio payload, specific processing is required. The 'connected' event helps establish the initial link, while 'start' and 'stop' events manage the stream's lifecycle.

ASSEMBLYAI REAL-TIME TRANSCRIPTION

The core of the real-time transcription is achieved by integrating AssemblyAI's Streaming Transcription API. A separate WebSocket connection is established to AssemblyAI using its specific endpoint and an API key. When the server receives a 'media' event from Twilio, the audio payload is extracted. This tutorial highlights a critical step: converting the audio format from Twilio's u-law encoding to the PCM format required by AssemblyAI. The 'wav-file' Node.js package is used for this conversion.

AUDIO DATA FORMATTING AND CHUNKING

Twilio sends audio data in 20-millisecond chunks, but AssemblyAI's API requires larger chunks, typically at least 100 milliseconds, for effective transcription. The tutorial addresses this by buffering these smaller chunks. Once a sufficient amount of audio data is collected (e.g., exceeding 100ms), it's combined into a single buffer, encoded, and then sent to the AssemblyAI WebSocket. This process ensures that the transcription service receives adequate audio data to accurately transcribe speech.

DISPLAYING TRANSCRIPTIONS ON THE WEBPAGE

After AssemblyAI processes the audio stream, it returns transcription results, including partial and final results. The code parses these responses, sorts them based on their 'audio_start' timestamp to ensure correct chronological order, and aggregates them into a single text output. To display this in real-time on the user's webpage, another WebSocket connection is used. The server broadcasts the generated text to all connected clients (the browser in this case) via this WebSocket, updating the web page dynamically as the call progresses.

IMPLEMENTING THE FRONTEND INTERFACE

A simple HTML file (index.html) is created to house the frontend logic. This HTML includes a script that establishes its own WebSocket connection to the server on port 8080. An 'onmessage' event handler is attached to this WebSocket. When the server sends transcription data, this handler receives it, extracts the text, and updates the 'innerHTML' of a designated HTML element on the page. This creates a seamless real-time display of the transcribed phone call content directly in the browser.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

Real-Time Call Transcription Steps

Practical takeaways from this episode

Do This

Set up a Node.js server using Express and WebSockets (ws).

Configure Twilio to stream incoming call audio to your WebSocket endpoint.

Use ngrok to expose your local server to the public internet for Twilio webhooks.

Integrate AssemblyAI's real-time streaming transcription API.

Handle audio data conversion from Twilio's mu-law format to PCM using the 'wav file' package.

Send audio chunks to AssemblyAI, ensuring they meet the minimum duration requirement (e.g., 100ms).

Display the transcribed text on a website using another WebSocket connection.

Avoid This

Do not forget to install necessary npm packages (ws, express, wav file).

Do not use a local server URL for Twilio webhooks; use a public URL from ngrok.

Do not send raw Twilio audio data directly to AssemblyAI without conversion.

Do not send audio chunks smaller than 100ms to AssemblyAI.

Do not forget to save your TypeScript code and restart the server.

Common Questions

This tutorial demonstrates how to transcribe phone calls in real-time using Twilio to stream the audio and AssemblyAI to perform the transcription via WebSockets.