Key Moments
Your Ground Truth Is Wrong: Evaluating STT with truth files & semantic WER | AssemblyAI Workshop
Key Moments
Traditional word error rate benchmarks are broken, often penalizing advanced ASR models more than simpler ones, and a new 'truth file corrector' tool can fix this.
Key Insights
Word error rate (WER) is flawed because it treats all errors equally, from minor punctuation differences to critical hallucinations or incorrect names.
AssemblyAI's Universal 3 Pro model has shown worse WER scores than previous models on customer benchmarks due to its high accuracy, producing insertions that human transcribers missed.
The new AssemblyAI 'Truth File Corrector' tool allows users to compare ASR transcriptions with human ground truth and create more accurate reference files for better evaluation.
Semantic Word Error Rate (SWER) aims to address issues where words are semantically equivalent (e.g., 'going to' vs. 'gonna') but are penalized by traditional WER.
A/B testing in production, focusing on user satisfaction and task completion, is a more reliable indicator of ASR performance than WER alone.
For streaming ASR, emission latency (speed of word-by-word output) and time to complete transcript (speed of final output after speaking) are more valuable metrics than time to first token.
Word error rate's fundamental flaws are misleading modern ASR evaluations
For years, Word Error Rate (WER) has been the standard metric for evaluating Automatic Speech Recognition (ASR) systems. However, the workshop highlights significant shortcomings in WER that are becoming increasingly problematic as ASR models advance. A key issue is that WER treats all errors – substitutions, deletions, and insertions – with equal weight. This means a minor error like 'okay' being transcribed as 'ok' is penalized the same as a critical hallucination or a misidentified proper noun. This undifferentiated approach fails to reflect the real-world impact of transcription errors. For instance, sophisticated models like OpenAI's Whisper are known for generating 'hallucinations,' fabricating words not present in the audio, which is treated identically to harmless errors that retain the original meaning. This can lead to a skewed perception of model performance, where more advanced models might appear worse under flawed benchmarks.
Advanced models are penalized by human-labeled 'ground truth' errors
A surprising trend observed with AssemblyAI's Universal 3 Pro model is that it sometimes scored worse on customer benchmarks than its predecessor, Universal 2. This counterintuitive result puzzled the team, given the new model's superior capabilities. Upon rigorous manual inspection, the Applied AI team discovered that Universal 3 Pro's high accuracy meant it was correctly transcribing elements that human transcribers had actually missed or omitted in the 'ground truth' files. These missed elements often appeared as 'insertions' in the advanced model's output but were erroneously flagged as errors. This phenomenon reveals a critical vulnerability in the benchmarking process: when the reference ('ground truth') data itself is flawed, advanced models that are more accurate to the audio may incorrectly receive lower scores. This highlights the urgent need to ensure the integrity and accuracy of the ground truth files used for evaluation.
Introducing the truth file corrector to fix flawed benchmarks
To address the issues of inaccurate ground truth files, AssemblyAI has developed and launched the 'Truth File Corrector' tool. This tool empowers users to upload their audio files and existing human-labeled ground truth files. It then uses a high-performance ASR model, like Universal 3 Pro, to generate a new transcription. The tool presents side-by-side comparisons of the original ground truth and the new transcription, allowing users to easily review discrepancies. Through an interactive interface, users can confirm whether the ASR's correction is accurate or if the original ground truth was correct. By clicking buttons, users can build an updated, more accurate ground truth file. This corrected file can then be used in existing benchmarking pipelines, leading to more reliable and accurate evaluations of ASR models. The tool is integrated into the AssemblyAI dashboard, simplifying the process of improving evaluation datasets.
Semantic equivalence and normalizing variations improve accuracy
Beyond outright errors, traditional WER struggles with semantic equivalence – instances where different transcriptions convey the same meaning. Examples include 'going to' versus 'gonna,' 'okay' versus 'OK,' or variations in naming conventions like 'Mr. Smith' versus 'Mister Smith.' While these don't change the core meaning, they are treated as errors by WER. The workshop introduced the concept of Semantic Word Error Rate (SWER) and the use of semantic word lists to mitigate this. These lists can be used to map semantically similar terms, preventing them from being unfairly penalized. While tools like the Whisper normalizer can handle some basic transformations (e.g., 'don't' to 'do not'), they often fail to capture domain-specific jargon, industry variations, or formatting nuances (like 'healthcare' versus 'health care') that can differ between transcriptions and human labels.
Production A/B testing offers a true measure of performance
The most robust method for evaluating ASR performance in real-world scenarios is A/B testing in production. This approach directly measures how transcription quality impacts key business outcomes, such as customer satisfaction, support ticket volume, or task completion rates. Instead of relying solely on offline metrics like WER, A/B testing involves segmenting live traffic and comparing the performance of different ASR models or configurations. By observing metrics like how often users need to correct transcripts, or whether specific ASR configurations lead to more successful customer interactions, businesses gain a clear understanding of which model provides the most value. This outcome-based evaluation is far more impactful than synthetic benchmarks, as it directly correlates ASR performance with user experience and business goals.
Missed Entity Rate focuses on critical information extraction
Word Error Rate treats all words equally, which is problematic when certain entities are crucial for a given use case. Missed Entity Rate (MER) addresses this by focusing on the accurate transcription of named entities, such as names, medical terms, credit card numbers, or email addresses. The process involves extracting these entities from ground truth data and assessing how many were correctly captured by the ASR model. For example, in a medical context, correctly transcribing a drug name is paramount. MER can quantify how often a model misses these critical terms, providing a more targeted evaluation than general WER. The workflow can involve using an LLM to identify entities in both the ground truth and the ASR output, then comparing them. A MER of zero indicates all critical entities were captured, highlighting superior performance in information-sensitive applications.
Evaluating streaming ASR requires different latency metrics
Evaluating streaming Automatic Speech Recognition (ASR) presents unique challenges compared to asynchronous processing. While accuracy metrics like WER and MER remain the same, latency metrics differ significantly. Metrics like 'time to first byte' or 'time to first token,' popular in LLMs, are less valuable for streaming ASR. Instead, AssemblyAI recommends focusing on 'emission latency' and 'time to complete transcript.' Emission latency measures how quickly individual words are transcribed after being spoken, crucial for real-time applications where immediate word-level feedback is needed. 'Time to complete transcript' measures how quickly the system provides the full, finalized transcription after the user has finished speaking. This is especially critical for voice agents, as it determines how fast the agent can process the user's input and respond, directly impacting user experience. For voice agent use cases, prioritizing these specific latency metrics and outcome-based A/B testing is essential for effective evaluation.
Mentioned in This Episode
●Software & Apps
●Companies
●People Referenced
Common Questions
WER treats all errors equally, meaning minor differences like 'okay' vs. 'okay' are penalized the same as critical errors like misnamed companies or drug names, and it doesn't account for semantic equivalence where the meaning is preserved. It also penalizes accurate models for inserting correct words missed by human transcribers.
Topics
Mentioned in this video
The company hosting the workshop, focused on AI solutions for speech-to-text.
Mentioned as the developer of the Whisper model, known for its advanced capabilities but also for hallucinations.
A platform mentioned as a source for open-source audio files and transcriptions that can be used with the truth file corrector.
An open-source speech-to-text model developed by OpenAI, used as a benchmark comparison.
AssemblyAI's previous speech-to-text model, shown to be outperformed by Universal 3 Pro.
An LLM used by AssemblyAI for evaluating transcription outputs and acting as a judge in benchmarking.
AssemblyAI's product that acts as a pass-through for frontier LLMs, used for simplifying API calls for LLM evaluations.
A software development kit available on GitHub for benchmarking speech-to-text models.
Mentioned as a tool that can be used with the benchmarking SDK for generating benchmarks via LLMs.
More from AssemblyAI
View all 49 summaries
1 minUniversal-3 Pro Streaming: Subway test
2 minUniversal-3 Pro: Office Icebreakers
20 minBuilding Quso.ai: Autonomous social media, the death of traditional SaaS, and founder lessons
61 minPrompt Engineering Workshop: Universal-3 Pro
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free