Key Moments
AI Dev 26 x SF | Adit Abraham: Better Agents with Better Data
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
AI agents can now perform complex tasks end-to-end, but silent data extraction errors in PDFs can lead to critical failures in high-stakes fields like healthcare, where 80-90% accuracy is insufficient.
Key Insights
AI agents are shifting from information synthesis (chatbots, search) to action-based systems capable of executing real human work end-to-end, requiring reading, decision-making, and writing.
PDFs are challenging because they were designed for print, not data interpretation, often lacking intuitive structure and leading to errors like misinterpreting charts or reading order.
Vision-language models (VLMs) represent a significant advancement in document processing, improving accuracy with handwritten text and complex layouts compared to traditional CV methods.
Agentic OCR, using techniques like speculative decoding, aims to improve accuracy and determinism by allowing models to iteratively correct token-level edits.
For optimal LLM reasoning, data formatting is crucial: HTML text is better for complex tables with merged cells, while Markdown is token-efficient for simpler tabular data.
A file-based architecture where agents navigate a file system is emerging as an alternative to RAG, reducing friction and enabling agents to determine when they need additional information.
The evolving landscape of AI applications: from chatbots to action-oriented agents
The AI landscape has rapidly evolved from simple chatbots and search tools in 2023, focused on information synthesis and retrieval, to sophisticated action-based systems. These new agents are designed to execute complex, end-to-end human tasks, which includes not only understanding content but also making decisions and performing actions like writing or editing. As the scope of agent capabilities expands, the impact of even minor errors becomes more significant. Frontier models still struggle with real-world documents, where silent failures in data extraction, such as misinterpreting table contexts or incorrect reading order, can have profound consequences. This is particularly critical in high-stakes domains like healthcare, where an 80-90% accuracy rate is insufficient when patient outcomes are on the line.
Why PDFs remain a significant challenge for AI
Despite decades of effort, PDFs continue to pose a formidable challenge for AI systems due to their fundamental design. Originally created for precise printing, PDFs often function like 'whiteboards' with arbitrary layouts, making it difficult for AI to discern semantic meaning and structure. Elements like gaps between paragraphs, indentation, or visual cues humans intuitively understand to denote relationships between text blocks are lost in simple digital parsing. Furthermore, charts or diagrams within PDFs may represent complex data tables that require nuanced interpretation. Models can struggle with subtle visual elements, such as redlining in contracts, which dramatically alters the meaning of the content. Historical methods relying on file metadata are insufficient, especially with scanned or image-based PDFs, underscoring the need for more advanced processing techniques.
Leveraging vision-language models and traditional CV for robust document understanding
The advent of vision-language models (VLMs) has marked a significant step change in document processing, offering unprecedented capabilities in reading diverse content, including handwritten text, which can outperform human readability in some cases. VLMs, combined with traditional computer vision (CV) techniques, provide a powerful dual approach to tackling document complexity. While VLMs excel at interpreting the content and nuances of various inputs, traditional CV methods, such as object detection and table segmentation, offer determinism and are effective for understanding document layout and spatial relationships. This combination allows for more reliable data extraction, especially in scenarios involving skewed scans, merged cells, or challenging reading orders. The goal is to preserve the visual structure that encodes meaning, ensuring that AI agents can reason about documents with the same fidelity as humans.
Agentic OCR and the quest for deterministic, high-fidelity extraction
To address the limitations of traditional optical character recognition (OCR), the concept of 'agentic OCR' has emerged. This technique, often leveraging principles of speculative decoding, focuses on iterative refinement at the token level. Instead of a single pass, the system makes token-level edits, correcting individual characters or words to achieve a more accurate and deterministic output. This process not only enhances accuracy but also preserves crucial document characteristics like bounding boxes and overall structure, avoiding the distortion that can occur with complete re-rendering. Agentic OCR represents a move towards an 'agent in the loop' paradigm, where AI agents perform verification and correction, reducing the reliance on human intervention for all but the most complex cases. This leads to higher quality outputs and allows humans to focus their efforts on the truly challenging scenarios identified through confidence scoring.
Optimizing data formatting for downstream AI consumers
A critical yet often overlooked aspect of AI pipeline development is formatting extracted data for its intended consumer, whether that be an LLM or an embedding model. The ideal format depends on the data's complexity. For large, simple tables without merged cells, Markdown is an effective, token-efficient format that LLMs can reason on well. However, for tables with merged cells, where row and column spans are vital, HTML is a superior format. Dynamically choosing between Markdown and HTML based on table complexity, such as the presence of spans, ensures optimal performance. Furthermore, when designing Retrieval Augmented Generation (RAG) systems, it's crucial to consider the limitations of embedding models. Embedding models may not capture the nuanced meaning of dense tables as well as LLMs. Therefore, summarizing table contents into natural language for retrieval and then passing the original ground truth HTML structure to the LLM for reasoning can significantly improve accuracy and prevent silent failures.
Expanding agent capabilities beyond extraction to editing and creation
The next frontier for AI agents extends far beyond simple parsing and extraction. Modern applications require agents to perform multi-step workflows, including classification, document splitting, and even editing and content creation. For instance, an agent might need to classify documents, split large mail records into distinct parts, or fill out forms and generate new documents like slide decks or reports. Reducto offers specialized endpoints for these tasks: 'Parse' for faithful document representation, 'Split' for segmenting documents, 'Structured Extraction' for mapping data to schemas, and an 'Editing' endpoint for modifying documents. These capabilities allow for more sophisticated agentic workflows, enabling AI to handle end-to-end tasks that mimic human productivity.
Agent harnesses and the pursuit of superhuman performance through iterative evaluation
Agent harnesses offer a powerful mechanism for models to iteratively improve their own outputs, moving towards human-level or even superhuman performance. Features like Reducto's 'Deep Extract' utilize a parent agent to coordinate sub-agents, each equipped with specific validation rules. These agents repeatedly audit results, ensuring logical consistency, such as verifying that line items sum up to the total on an invoice. When compute time is not a constraint, these harnesses can achieve performance that surpasses human-in-the-loop processes, which are susceptible to human error, fatigue, or bias. This iterative self-correction is crucial for tackling complex tasks, including the accurate extraction of data from time-series charts, where specialized models can decompose the problem, interpret axes, and re-render outputs for verification, ultimately generating structured data tables from visual representations.
The paramount importance of end-to-end pipeline evaluation and a file-system architecture for agents
Effective pipeline development hinges on robust evaluation at every stage. Mistakenly focusing solely on a single input-output check is insufficient; evaluating each step—parsing, retrieval, formatting, and overall system performance—is critical to prevent cascading failures. Early stage errors in parsing compound significantly. Furthermore, the paradigm is shifting away from traditional RAG towards file-based architectures. In this model, agents are given a file system and tools to navigate and determine what information they need, removing friction associated with fixed chunking and embedding limitations. This approach, coupled with meticulous metadata provision (like bounding boxes for traceability), allows agents to robustly search, plan, and execute tasks. Ultimately, building systems where agents can not only retrieve but also write, edit, and produce deliverables is key to enabling end-to-end work, mirroring human processes of synthesis and creation.
Mentioned in This Episode
●Software & Apps
●Companies
●Concepts
Best Practices for Building Better AI Agents
Practical takeaways from this episode
Do This
Avoid This
Common Questions
The primary bottleneck is providing high-quality, relevant data to the AI agents. Ensuring the data is accurately parsed, structured, and formatted for consumption by the agent is crucial for effective performance.
Topics
Mentioned in this video
A company focused on building agentic document extraction for AI teams. They emphasize giving better data to agents and have processed over 3 billion documents.
Mentioned as an example of a new category of AI application companies that Reducto works with.
Mentioned as an example of a new category of AI application companies that Reducto works with.
Mentioned as a provider of frontier models that still exhibit errors in real-world document processing, such as pulling context from the wrong rows in tables.
Mentioned as an example of a new category of AI application companies that Reducto works with.
VLMs are presented as the most significant advancement for solving document processing challenges, capable of reading handwritten text and improving accuracy.
HyperText Markup Language, found to be better for models to reason on complex tables with merged cells compared to markdown.
Models used for creating numerical representations of data, discussed in contrast to language models, with limitations in capturing nuances of dense tables compared to LLMs.
A feature that acts as an agent harness for structured extraction, composed of a parent agent and sub-agents with validation rules to iteratively audit and refine results.
Command Line Interface, mentioned as a tool provided to agents in a file-based architecture, enabling them to navigate file systems and decide if they need additional information.
More from DeepLearningAI
View all 77 summaries
32 minAI Dev 26 x SF | Nyah Macklin: The AI Said So? How to Build Auditable AI Agents Using Context Graphs
34 minAI Dev 26 x SF | William Imoh & Charlie Wood: Closing the Care Gap
43 minAI Dev 26 x SF | Paige Bailey: What's New and What's Next in AI
32 minAI Dev 26 x SF | Aditi Gupta: Building SRE Agents with the Redis Context Engine
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free