How does AssemblyAI's 'Truth File Corrector' tool improve ASR benchmarking?

The tool allows users to upload audio files with existing human-labeled ground truth and compare them against transcriptions from models like Universal 3 Pro. Users can then correct the ground truth to accurately reflect what was spoken, leading to more reliable WER scores and improved future evaluations.

What is 'Missed Entity Rate' and why is it important for ASR?

Missed Entity Rate (MER) focuses on the accuracy of specific, critical entities (like names, drug names, or numbers) within a transcription, rather than every single word. This is crucial for use cases like medical transcription or voice agents, where missing key information can have significant consequences.

How can users evaluate ASR models for voice agent applications?

For voice agents, it's recommended to focus on outcome-based metrics rather than just WER. This includes AB testing models in production to see their impact on user satisfaction, support tickets, or task completion rates. Specific metrics like emission latency and time to complete transcript are also vital for streaming use cases.

What are the key differences when evaluating streaming ASR models compared to async models?

While accuracy metrics remain similar, streaming ASR evaluation introduces the importance of latency metrics like emission latency (how quickly each word is transcribed) and time to complete transcript (how quickly the full user turn is finalized). These are more relevant than 'time to first token' for real-time speech applications.

How can ASR models be improved for domain-specific accuracy, like in medical contexts?

Models like Universal 3 Pro can be configured with domain-specific parameters (e.g., 'domain=medical v1') to improve accuracy with specialized terminology. The use of LLMs for evaluating missed entities and the ongoing development of models with better contextual awareness also contribute significantly.

Key Moments

Your Ground Truth Is Wrong: Evaluating STT with truth files & semantic WER | AssemblyAI Workshop

AssemblyAI

Science & Technology6 min read53 min video

Mar 31, 2026|16 views

Save to Pod

Key Moments

TL;DR

Traditional word error rate benchmarks are broken, often penalizing advanced ASR models more than simpler ones, and a new 'truth file corrector' tool can fix this.

Key Insights

Word error rate (WER) is flawed because it treats all errors equally, from minor punctuation differences to critical hallucinations or incorrect names.

AssemblyAI's Universal 3 Pro model has shown worse WER scores than previous models on customer benchmarks due to its high accuracy, producing insertions that human transcribers missed.

The new AssemblyAI 'Truth File Corrector' tool allows users to compare ASR transcriptions with human ground truth and create more accurate reference files for better evaluation.

Semantic Word Error Rate (SWER) aims to address issues where words are semantically equivalent (e.g., 'going to' vs. 'gonna') but are penalized by traditional WER.

A/B testing in production, focusing on user satisfaction and task completion, is a more reliable indicator of ASR performance than WER alone.

For streaming ASR, emission latency (speed of word-by-word output) and time to complete transcript (speed of final output after speaking) are more valuable metrics than time to first token.

Word error rate's fundamental flaws are misleading modern ASR evaluations

For years, Word Error Rate (WER) has been the standard metric for evaluating Automatic Speech Recognition (ASR) systems. However, the workshop highlights significant shortcomings in WER that are becoming increasingly problematic as ASR models advance. A key issue is that WER treats all errors – substitutions, deletions, and insertions – with equal weight. This means a minor error like 'okay' being transcribed as 'ok' is penalized the same as a critical hallucination or a misidentified proper noun. This undifferentiated approach fails to reflect the real-world impact of transcription errors. For instance, sophisticated models like OpenAI's Whisper are known for generating 'hallucinations,' fabricating words not present in the audio, which is treated identically to harmless errors that retain the original meaning. This can lead to a skewed perception of model performance, where more advanced models might appear worse under flawed benchmarks.

Advanced models are penalized by human-labeled 'ground truth' errors

A surprising trend observed with AssemblyAI's Universal 3 Pro model is that it sometimes scored worse on customer benchmarks than its predecessor, Universal 2. This counterintuitive result puzzled the team, given the new model's superior capabilities. Upon rigorous manual inspection, the Applied AI team discovered that Universal 3 Pro's high accuracy meant it was correctly transcribing elements that human transcribers had actually missed or omitted in the 'ground truth' files. These missed elements often appeared as 'insertions' in the advanced model's output but were erroneously flagged as errors. This phenomenon reveals a critical vulnerability in the benchmarking process: when the reference ('ground truth') data itself is flawed, advanced models that are more accurate to the audio may incorrectly receive lower scores. This highlights the urgent need to ensure the integrity and accuracy of the ground truth files used for evaluation.

Introducing the truth file corrector to fix flawed benchmarks

To address the issues of inaccurate ground truth files, AssemblyAI has developed and launched the 'Truth File Corrector' tool. This tool empowers users to upload their audio files and existing human-labeled ground truth files. It then uses a high-performance ASR model, like Universal 3 Pro, to generate a new transcription. The tool presents side-by-side comparisons of the original ground truth and the new transcription, allowing users to easily review discrepancies. Through an interactive interface, users can confirm whether the ASR's correction is accurate or if the original ground truth was correct. By clicking buttons, users can build an updated, more accurate ground truth file. This corrected file can then be used in existing benchmarking pipelines, leading to more reliable and accurate evaluations of ASR models. The tool is integrated into the AssemblyAI dashboard, simplifying the process of improving evaluation datasets.

Semantic equivalence and normalizing variations improve accuracy

Beyond outright errors, traditional WER struggles with semantic equivalence – instances where different transcriptions convey the same meaning. Examples include 'going to' versus 'gonna,' 'okay' versus 'OK,' or variations in naming conventions like 'Mr. Smith' versus 'Mister Smith.' While these don't change the core meaning, they are treated as errors by WER. The workshop introduced the concept of Semantic Word Error Rate (SWER) and the use of semantic word lists to mitigate this. These lists can be used to map semantically similar terms, preventing them from being unfairly penalized. While tools like the Whisper normalizer can handle some basic transformations (e.g., 'don't' to 'do not'), they often fail to capture domain-specific jargon, industry variations, or formatting nuances (like 'healthcare' versus 'health care') that can differ between transcriptions and human labels.

Production A/B testing offers a true measure of performance

The most robust method for evaluating ASR performance in real-world scenarios is A/B testing in production. This approach directly measures how transcription quality impacts key business outcomes, such as customer satisfaction, support ticket volume, or task completion rates. Instead of relying solely on offline metrics like WER, A/B testing involves segmenting live traffic and comparing the performance of different ASR models or configurations. By observing metrics like how often users need to correct transcripts, or whether specific ASR configurations lead to more successful customer interactions, businesses gain a clear understanding of which model provides the most value. This outcome-based evaluation is far more impactful than synthetic benchmarks, as it directly correlates ASR performance with user experience and business goals.

Missed Entity Rate focuses on critical information extraction

Word Error Rate treats all words equally, which is problematic when certain entities are crucial for a given use case. Missed Entity Rate (MER) addresses this by focusing on the accurate transcription of named entities, such as names, medical terms, credit card numbers, or email addresses. The process involves extracting these entities from ground truth data and assessing how many were correctly captured by the ASR model. For example, in a medical context, correctly transcribing a drug name is paramount. MER can quantify how often a model misses these critical terms, providing a more targeted evaluation than general WER. The workflow can involve using an LLM to identify entities in both the ground truth and the ASR output, then comparing them. A MER of zero indicates all critical entities were captured, highlighting superior performance in information-sensitive applications.

Evaluating streaming ASR requires different latency metrics

Evaluating streaming Automatic Speech Recognition (ASR) presents unique challenges compared to asynchronous processing. While accuracy metrics like WER and MER remain the same, latency metrics differ significantly. Metrics like 'time to first byte' or 'time to first token,' popular in LLMs, are less valuable for streaming ASR. Instead, AssemblyAI recommends focusing on 'emission latency' and 'time to complete transcript.' Emission latency measures how quickly individual words are transcribed after being spoken, crucial for real-time applications where immediate word-level feedback is needed. 'Time to complete transcript' measures how quickly the system provides the full, finalized transcription after the user has finished speaking. This is especially critical for voice agents, as it determines how fast the agent can process the user's input and respond, directly impacting user experience. For voice agent use cases, prioritizing these specific latency metrics and outcome-based A/B testing is essential for effective evaluation.

Mentioned in This Episode

●Software & Apps

●Companies

●People Referenced

Common Questions

WER treats all errors equally, meaning minor differences like 'okay' vs. 'okay' are penalized the same as critical errors like misnamed companies or drug names, and it doesn't account for semantic equivalence where the meaning is preserved. It also penalizes accurate models for inserting correct words missed by human transcribers.

Topics

AI & Machine Learning Technology & Innovation Word Error Rate Machine-learning Models LLM Evaluation Automatic Speech Recognition ASR Benchmarking

Mentioned in this video

Companies

AssemblyAI

The company hosting the workshop, focused on AI solutions for speech-to-text.

OpenAI

Mentioned as the developer of the Whisper model, known for its advanced capabilities but also for hallucinations.

Hugging Face

A platform mentioned as a source for open-source audio files and transcriptions that can be used with the truth file corrector.

People

Ryan Seams

VP of Customer Solutions at AssemblyAI, also present at the workshop.

Products

Universal 3 Pro

AssemblyAI's newest speech-to-text model, which prompted the discussion on benchmarking issues due to its advanced accuracy.

Software & Apps

Whisper

An open-source speech-to-text model developed by OpenAI, used as a benchmark comparison.

Universal 2

AssemblyAI's previous speech-to-text model, shown to be outperformed by Universal 3 Pro.

Claude

An LLM used by AssemblyAI for evaluating transcription outputs and acting as a judge in benchmarking.

LLM Gateway

AssemblyAI's product that acts as a pass-through for frontier LLMs, used for simplifying API calls for LLM evaluations.

Python

A software development kit available on GitHub for benchmarking speech-to-text models.

Claude Code

Mentioned as a tool that can be used with the benchmarking SDK for generating benchmarks via LLMs.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free