How do companies like EdgeTier handle the massive volume of daily conversations?

EdgeTier ingests data from various channels and providers into a central API, storing it in a PostgreSQL database. They then use queues for post-processing, applying additional semantics like emotions and keywords to enrich the data for analysis and proactive alerting.

What are the challenges of ensuring transcription quality, especially when audio isn't stored?

Ensuring transcription quality is difficult due to the wide range of audio environments (e.g., noisy rooms, multiple speakers) and lack of stored audio for benchmarking. Granola relies on internal testing and user feedback, but the inability to re-evaluate audio presents a trade-off.

Why is speaker identification crucial in voice AI applications?

Accurate speaker identification is vital for assigning utterances correctly, which impacts the analysis of customer statements and agent performance. Misidentification can lead to incorrect conclusions, especially in complex scenarios with multiple speakers or transfers.

What are the key design decisions when choosing between real-time and post-call voice processing?

The decision depends on the trade-offs between technical feasibility, immediate insight needs, and product features. EdgeTier prioritizes near-real-time post-call processing for faster insights, while Granola uses real-time for its desktop app to enhance user experience and trust.

What are the biggest hurdles for global language support in voice AI?

The primary hurdle is supporting low-resource languages where accurate transcription models are lacking. This impacts companies that need to serve global customer bases and limits their ability to provide a complete customer context layer across all regions.

How is fine-tuning models with LLMs improving transcription accuracy?

LLMs integrated into the decoder can use contextual understanding to bias transcription, improving accuracy for mixed-language conversations like Spanglish or specific dialects. This allows models to better predict sequences with words from multiple languages.

What is the future impact of smaller AI models and on-device processing on voice AI applications?

Smaller, faster, and potentially on-device models could enable more localized and real-time processing, reducing reliance on cloud APIs. This opens possibilities for microservices, specialized smaller models, and immediate feedback loops, transforming user interaction with AI.

Key Moments

Voice AI: Beyond Transcription with Granola, CoLoop & EdgeTier

AssemblyAI

Science & Technology6 min read56 min video

May 11, 2026|152 views|2

Save to Pod

Key Moments

TL;DR

Voice AI is moving beyond simple transcription to provide structured insights, but domain-specific accuracy and multilingual support remain significant challenges.

Key Insights

Companies spend significant time post-processing transcribed audio, augmenting it with semantic context (like discussion guides) to improve accuracy, especially in technical domains.

EdgeTier processes 10,000-20,000 contact center conversations daily across multiple channels, differentiating on near real-time insights and proactive alerts.

Granola uses real-time transcription for its desktop app to build user confidence and enable features like "what did I just say?", while its iOS app uses asynchronous batch processing due to network constraints.

Accurate speaker diarization is crucial, as misattributions can lead to confusion, particularly in meetings where many people speak or in contact centers with multiple agents on one call.

Global language support, especially for low-resource languages, is a major hurdle; models struggle with dialects and mixed-language conversations, requiring specialized solutions.

Advancements in AI models, particularly LLMs in decoders and promptable, smaller models, are enabling better handling of noisy audio, mixed languages, and user-controlled output.

Beyond basic transcription: the need for richer context and accuracy

The panel discussion highlighted a clear shift in voice AI from basic transcription to generating structured outputs, actionable insights, and enabling product experiences. Companies like CoLoop, EdgeTier, and Granola leverage voice AI for diverse applications, from analyzing customer interviews to optimizing contact center operations and creating meeting summaries. However, achieving high accuracy, especially in complex or technical domains, remains a significant challenge. CoLoop, for instance, augments transcriptions from models like Assembly AI by incorporating semantic context from project setup, discussion guides, and keywords. This is crucial for accurate assignment of speakers (diarization) and correct transcription of specialized terminology, particularly in fields like pharmaceuticals where precise language is vital. The participants agreed that while transcription is foundational, the true value lies in what happens next: analyzing, structuring, and deriving meaning from the spoken word.

Processing high-volume data for real-time insights

EdgeTier operates at a massive scale, processing 10,000 to 20,000 contact center conversations daily across calls, emails, chats, and surveys. Their pipeline involves ingesting data from various CRM and contact center platforms, cleaning it into a unified format, and storing it in a PostgreSQL database. Post-processing layers add semantics like emotion detection and keywords. A key differentiator for EdgeTier is 'near real-time' insights, enabling them to assess conversations from the last 30 minutes and compare themes against historical data to detect unusual patterns or emerging issues. This allows for proactive alerts and fast action, critical for high-volume operations. The UI/UX for exploring this vast amount of data is also a major focus, needing to be both easy to use and flexible enough to answer diverse ad-hoc queries.

Real-time versus asynchronous processing: product-driven decisions

Granola's approach to voice AI is dictated by its product needs and user experience. Their desktop application uses real-time transcription to provide immediate feedback to users, reassuring them that their meetings are being captured accurately and enabling features like asking "what did I just say?" This real-time aspect was a deliberate product decision to build user confidence early on. In contrast, their iOS app employs asynchronous batch processing for transcription. This is a pragmatic choice due to potential network constraints and the less predictable environments users might be in when using their phones, such as a coffee shop. While users don't typically fixate on the raw transcription itself, its quality is paramount as it forms the basis for downstream insights and summaries. Ensuring accurate transcription, including industry-specific keywords and names, is therefore a continuous effort.

The critical importance of speaker diarization and identification

Accurate speaker diarization—identifying who said what—is universally recognized as essential. For CoLoop, it's imperative for assigning utterances to either moderators or participants, especially in qualitative research where precise attribution is key. EdgeTier faces similar challenges in contact centers, where differentiating between customer and agent speech is fundamental for analyzing customer sentiment and agent performance. Initial assumptions that the first speaker is always the agent fail when outbound calls or call transfers occur, necessitating sophisticated LLM-based methods to correctly identify speakers across multiple participants, supervisors, and even third parties in a three-way call. Granola also highlights the importance of proper diarization in meetings, where misattributing actions or statements can lead to significant confusion.

Addressing global language needs and noisy environments

A major area of excitement and a significant challenge is global language support, particularly for low-resource languages. Companies like CoLoop struggle with languages in Southeast Asia and the Philippines, finding them difficult to transcribe accurately. EdgeTier also acknowledges issues with dialects and specific language variations (like Flemish being detected as Belgian). While multilingual models are improving, handling mixed-language conversations within a single utterance or call remains complex. Similarly, noisy environments and choppy audio present ongoing difficulties. Although models are becoming more robust, and techniques like promptable models allowing users to 'ignore background noise' or 'transcribe everything' are emerging, truly flawless transcription in suboptimal conditions is still an aspiration. Some companies, like Granola, intentionally do not store audio, which limits their ability to re-evaluate and correct transcriptions, relying instead on LLMs to interpret and smooth over minor inaccuracies in the downstream analysis.

The future of Voice AI: real-time agents and on-device processing

The panel explored future directions, including real-time voice agents and on-device processing. CoLoop sees potential in real-time transcription for stakeholder involvement during research calls, supporting moderators with live insights, and even developing AI interviewers. EdgeTier is exploring real-time alerts for supervisors and agents within their systems, though their current focus remains on 'near real-time' post-call analysis due to integration complexities. The rise of smaller, more efficient AI models opens possibilities for on-device processing, potentially reducing reliance on cloud APIs and offering faster, more private interactions. However, for applications where data quality is paramount, like high-cost professional transcription services, cloud-based, high-quality models will likely remain dominant for some time. The debate continues on how to balance the cost, speed, and quality of AI models, whether deployed locally or via API.

Navigating trust, security, and user adoption

Trust and reliability are key concerns, especially when dealing with sensitive customer data. EdgeTier found that offering a lower buyer risk, such as a trial period and non-disruptive integration, was crucial for initial traction. They also noted a surprising variance in customer security postures, from stringent multi-month security reviews to casual sharing of passwords. For Granola, the decision not to store audio data is a deliberate choice to build user trust and security. The complexity of advanced voice AI systems also presents an adoption challenge; while companies might build sophisticated interfaces, users may struggle to learn and effectively utilize them. This is driving interest in agentic approaches and simpler interfaces, like conversational AI (MCP servers) or integrations within common workflows like Slack and email, to expose the power of voice AI without overwhelming the end-user.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●People Referenced

Common Questions

The main challenge lies in handling specific terminology accurately. Companies like Kodoop use semantic context from project setups and LLM passes to augment transcriptions and correct terms that sound similar but have different meanings in technical fields.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Voice AI Speech-to-text Natural Language Processing Conversational Intelligence Contact Centers Customer Insights

Mentioned in this video

Products

Mac

Operating system for Granola's desktop app.

Universal 3 Pro

A product mentioned in the context of transcription accuracy for Kodoop.

Companies

Kodoop

A company that helps businesses understand customers better by analyzing interview data using voice AI for transcription.

EdgeTier

A conversational intelligence platform for high-volume contact centers, processing conversations across various channels to improve customer experience and agent performance.

Granola

A company that builds products for taking meeting notes, utilizing voice AI for transcription and summarization.

Salesforce

Mentioned as an integration for data ingestion in EdgeTier's pipeline.

Intercom

Mentioned as an integration for data ingestion in EdgeTier's pipeline.

Zendesk

Mentioned as an integration for data ingestion in EdgeTier's pipeline.

Genesys

Mentioned as a voice provider for data ingestion in EdgeTier's pipeline.

8x8

Mentioned as a voice provider for data ingestion in EdgeTier's pipeline.

Qualtrics

Mentioned as a survey provider for data ingestion in EdgeTier's pipeline.

Amazon

Mentioned for post-processing data through their services in EdgeTier's pipeline.

Hume.ai

A company mentioned for its work in analyzing emotions through facial expressions and voice.

People

Shane Lynn

Co-founder and CEO of EdgeTier, explaining their platform's capabilities in analyzing large volumes of customer conversations.

Software & Apps

SurveyMonkey

Mentioned as a survey provider for data ingestion in EdgeTier's pipeline.

PostgreSQL

The database used by EdgeTier to store processed contact center data.

Slack

Used by EdgeTier for sending proactive alerts to users.

Windows

Operating system for Granola's desktop app.

iOS

Operating system for Granola's mobile app.

Llama 3

An older version of a model mentioned in the context of research on encoding model weights into hardware chips for faster processing.

ChatJimmy.com

A website mentioned as a demonstration of extremely fast AI code generation.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Voice AI: Beyond Transcription with Granola, CoLoop & EdgeTier

Key Insights

Beyond basic transcription: the need for richer context and accuracy

Processing high-volume data for real-time insights

Real-time versus asynchronous processing: product-driven decisions

The critical importance of speaker diarization and identification

Addressing global language needs and noisy environments

The future of Voice AI: real-time agents and on-device processing

Navigating trust, security, and user adoption

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from AssemblyAI

Your Ground Truth Is Wrong: Evaluating STT with truth files & semantic WER | AssemblyAI Workshop

Universal-3 Pro Streaming: Subway test

Universal-3 Pro: Office Icebreakers

Building Quso.ai: Autonomous social media, the death of traditional SaaS, and founder lessons

Ask anything from this episode.