Key Moments

Voice AI: Beyond Transcription with Granola, CoLoop & EdgeTier

AssemblyAIAssemblyAI
Science & Technology6 min read56 min video
May 11, 2026|152 views|2
Save to Pod
TL;DR

Voice AI is moving beyond simple transcription to provide structured insights, but domain-specific accuracy and multilingual support remain significant challenges.

Key Insights

1

Companies spend significant time post-processing transcribed audio, augmenting it with semantic context (like discussion guides) to improve accuracy, especially in technical domains.

2

EdgeTier processes 10,000-20,000 contact center conversations daily across multiple channels, differentiating on near real-time insights and proactive alerts.

3

Granola uses real-time transcription for its desktop app to build user confidence and enable features like "what did I just say?", while its iOS app uses asynchronous batch processing due to network constraints.

4

Accurate speaker diarization is crucial, as misattributions can lead to confusion, particularly in meetings where many people speak or in contact centers with multiple agents on one call.

5

Global language support, especially for low-resource languages, is a major hurdle; models struggle with dialects and mixed-language conversations, requiring specialized solutions.

6

Advancements in AI models, particularly LLMs in decoders and promptable, smaller models, are enabling better handling of noisy audio, mixed languages, and user-controlled output.

Beyond basic transcription: the need for richer context and accuracy

The panel discussion highlighted a clear shift in voice AI from basic transcription to generating structured outputs, actionable insights, and enabling product experiences. Companies like CoLoop, EdgeTier, and Granola leverage voice AI for diverse applications, from analyzing customer interviews to optimizing contact center operations and creating meeting summaries. However, achieving high accuracy, especially in complex or technical domains, remains a significant challenge. CoLoop, for instance, augments transcriptions from models like Assembly AI by incorporating semantic context from project setup, discussion guides, and keywords. This is crucial for accurate assignment of speakers (diarization) and correct transcription of specialized terminology, particularly in fields like pharmaceuticals where precise language is vital. The participants agreed that while transcription is foundational, the true value lies in what happens next: analyzing, structuring, and deriving meaning from the spoken word.

Processing high-volume data for real-time insights

EdgeTier operates at a massive scale, processing 10,000 to 20,000 contact center conversations daily across calls, emails, chats, and surveys. Their pipeline involves ingesting data from various CRM and contact center platforms, cleaning it into a unified format, and storing it in a PostgreSQL database. Post-processing layers add semantics like emotion detection and keywords. A key differentiator for EdgeTier is 'near real-time' insights, enabling them to assess conversations from the last 30 minutes and compare themes against historical data to detect unusual patterns or emerging issues. This allows for proactive alerts and fast action, critical for high-volume operations. The UI/UX for exploring this vast amount of data is also a major focus, needing to be both easy to use and flexible enough to answer diverse ad-hoc queries.

Real-time versus asynchronous processing: product-driven decisions

Granola's approach to voice AI is dictated by its product needs and user experience. Their desktop application uses real-time transcription to provide immediate feedback to users, reassuring them that their meetings are being captured accurately and enabling features like asking "what did I just say?" This real-time aspect was a deliberate product decision to build user confidence early on. In contrast, their iOS app employs asynchronous batch processing for transcription. This is a pragmatic choice due to potential network constraints and the less predictable environments users might be in when using their phones, such as a coffee shop. While users don't typically fixate on the raw transcription itself, its quality is paramount as it forms the basis for downstream insights and summaries. Ensuring accurate transcription, including industry-specific keywords and names, is therefore a continuous effort.

The critical importance of speaker diarization and identification

Accurate speaker diarization—identifying who said what—is universally recognized as essential. For CoLoop, it's imperative for assigning utterances to either moderators or participants, especially in qualitative research where precise attribution is key. EdgeTier faces similar challenges in contact centers, where differentiating between customer and agent speech is fundamental for analyzing customer sentiment and agent performance. Initial assumptions that the first speaker is always the agent fail when outbound calls or call transfers occur, necessitating sophisticated LLM-based methods to correctly identify speakers across multiple participants, supervisors, and even third parties in a three-way call. Granola also highlights the importance of proper diarization in meetings, where misattributing actions or statements can lead to significant confusion.

Addressing global language needs and noisy environments

A major area of excitement and a significant challenge is global language support, particularly for low-resource languages. Companies like CoLoop struggle with languages in Southeast Asia and the Philippines, finding them difficult to transcribe accurately. EdgeTier also acknowledges issues with dialects and specific language variations (like Flemish being detected as Belgian). While multilingual models are improving, handling mixed-language conversations within a single utterance or call remains complex. Similarly, noisy environments and choppy audio present ongoing difficulties. Although models are becoming more robust, and techniques like promptable models allowing users to 'ignore background noise' or 'transcribe everything' are emerging, truly flawless transcription in suboptimal conditions is still an aspiration. Some companies, like Granola, intentionally do not store audio, which limits their ability to re-evaluate and correct transcriptions, relying instead on LLMs to interpret and smooth over minor inaccuracies in the downstream analysis.

The future of Voice AI: real-time agents and on-device processing

The panel explored future directions, including real-time voice agents and on-device processing. CoLoop sees potential in real-time transcription for stakeholder involvement during research calls, supporting moderators with live insights, and even developing AI interviewers. EdgeTier is exploring real-time alerts for supervisors and agents within their systems, though their current focus remains on 'near real-time' post-call analysis due to integration complexities. The rise of smaller, more efficient AI models opens possibilities for on-device processing, potentially reducing reliance on cloud APIs and offering faster, more private interactions. However, for applications where data quality is paramount, like high-cost professional transcription services, cloud-based, high-quality models will likely remain dominant for some time. The debate continues on how to balance the cost, speed, and quality of AI models, whether deployed locally or via API.

Navigating trust, security, and user adoption

Trust and reliability are key concerns, especially when dealing with sensitive customer data. EdgeTier found that offering a lower buyer risk, such as a trial period and non-disruptive integration, was crucial for initial traction. They also noted a surprising variance in customer security postures, from stringent multi-month security reviews to casual sharing of passwords. For Granola, the decision not to store audio data is a deliberate choice to build user trust and security. The complexity of advanced voice AI systems also presents an adoption challenge; while companies might build sophisticated interfaces, users may struggle to learn and effectively utilize them. This is driving interest in agentic approaches and simpler interfaces, like conversational AI (MCP servers) or integrations within common workflows like Slack and email, to expose the power of voice AI without overwhelming the end-user.

Common Questions

The main challenge lies in handling specific terminology accurately. Companies like Kodoop use semantic context from project setups and LLM passes to augment transcriptions and correct terms that sound similar but have different meanings in technical fields.

Topics

Mentioned in this video

More from AssemblyAI

View all 50 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free