Why are PDFs still challenging for AI to process accurately?

PDFs were originally designed for printing and lack inherent structure for machine readability. They can contain complex layouts, handwritten text, merged cells, and challenging reading orders, making accurate parsing and extraction difficult.

What is 'agentic OCR' and how does it improve accuracy?

Agentic OCR uses techniques like speculative decoding to achieve more accurate results than traditional OCR. It allows models to iteratively correct errors and can function as an 'agent in the loop' for verification, catching mistakes that human perception might miss.

How should data be formatted for different AI consumers?

The ideal format depends on the consumer. Markdown is effective for simple tables, while HTML is better for complex tables with merged cells. Summarizing tabular data into natural language for embedding models can improve retrieval.

What are the key components of Reducto's platform for document processing?

Reducto offers four main endpoints: parse (faithfully capturing document structure), splitting (breaking down documents into subregions), structured extraction (mapping data to a schema), and editing (allowing content insertion into documents).

How can AI agents achieve superhuman performance?

By using sophisticated agent harnesses with iterative evaluation and validation loops, like Deep Extract. When not restricted by compute time, these systems can continuously refine their outputs to surpass human-level accuracy on many tasks.

Why is evaluating AI pipelines at every stage so important?

Errors at the beginning of a pipeline, such as in parsing or retrieval, can compound and lead to cascading failures. Continuous evaluation ensures each component functions correctly, leading to a more reliable end-to-end system.

What is the shift happening away from traditional RAG?

There's a move towards file-based architectures where agents are given a file system and tools. This allows agents to autonomously navigate and select necessary information, reducing friction compared to traditional chunking and embedding limitations in RAG.

Key Moments

AI Dev 26 x SF | Adit Abraham: Better Agents with Better Data

DeepLearning.AI

Education6 min read29 min video

May 20, 2026|275 views|6

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

AI agents can now perform complex tasks end-to-end, but silent data extraction errors in PDFs can lead to critical failures in high-stakes fields like healthcare, where 80-90% accuracy is insufficient.

Key Insights

AI agents are shifting from information synthesis (chatbots, search) to action-based systems capable of executing real human work end-to-end, requiring reading, decision-making, and writing.

PDFs are challenging because they were designed for print, not data interpretation, often lacking intuitive structure and leading to errors like misinterpreting charts or reading order.

Vision-language models (VLMs) represent a significant advancement in document processing, improving accuracy with handwritten text and complex layouts compared to traditional CV methods.

Agentic OCR, using techniques like speculative decoding, aims to improve accuracy and determinism by allowing models to iteratively correct token-level edits.

For optimal LLM reasoning, data formatting is crucial: HTML text is better for complex tables with merged cells, while Markdown is token-efficient for simpler tabular data.

A file-based architecture where agents navigate a file system is emerging as an alternative to RAG, reducing friction and enabling agents to determine when they need additional information.

The evolving landscape of AI applications: from chatbots to action-oriented agents

The AI landscape has rapidly evolved from simple chatbots and search tools in 2023, focused on information synthesis and retrieval, to sophisticated action-based systems. These new agents are designed to execute complex, end-to-end human tasks, which includes not only understanding content but also making decisions and performing actions like writing or editing. As the scope of agent capabilities expands, the impact of even minor errors becomes more significant. Frontier models still struggle with real-world documents, where silent failures in data extraction, such as misinterpreting table contexts or incorrect reading order, can have profound consequences. This is particularly critical in high-stakes domains like healthcare, where an 80-90% accuracy rate is insufficient when patient outcomes are on the line.

Why PDFs remain a significant challenge for AI

Despite decades of effort, PDFs continue to pose a formidable challenge for AI systems due to their fundamental design. Originally created for precise printing, PDFs often function like 'whiteboards' with arbitrary layouts, making it difficult for AI to discern semantic meaning and structure. Elements like gaps between paragraphs, indentation, or visual cues humans intuitively understand to denote relationships between text blocks are lost in simple digital parsing. Furthermore, charts or diagrams within PDFs may represent complex data tables that require nuanced interpretation. Models can struggle with subtle visual elements, such as redlining in contracts, which dramatically alters the meaning of the content. Historical methods relying on file metadata are insufficient, especially with scanned or image-based PDFs, underscoring the need for more advanced processing techniques.

Leveraging vision-language models and traditional CV for robust document understanding

The advent of vision-language models (VLMs) has marked a significant step change in document processing, offering unprecedented capabilities in reading diverse content, including handwritten text, which can outperform human readability in some cases. VLMs, combined with traditional computer vision (CV) techniques, provide a powerful dual approach to tackling document complexity. While VLMs excel at interpreting the content and nuances of various inputs, traditional CV methods, such as object detection and table segmentation, offer determinism and are effective for understanding document layout and spatial relationships. This combination allows for more reliable data extraction, especially in scenarios involving skewed scans, merged cells, or challenging reading orders. The goal is to preserve the visual structure that encodes meaning, ensuring that AI agents can reason about documents with the same fidelity as humans.

Agentic OCR and the quest for deterministic, high-fidelity extraction

To address the limitations of traditional optical character recognition (OCR), the concept of 'agentic OCR' has emerged. This technique, often leveraging principles of speculative decoding, focuses on iterative refinement at the token level. Instead of a single pass, the system makes token-level edits, correcting individual characters or words to achieve a more accurate and deterministic output. This process not only enhances accuracy but also preserves crucial document characteristics like bounding boxes and overall structure, avoiding the distortion that can occur with complete re-rendering. Agentic OCR represents a move towards an 'agent in the loop' paradigm, where AI agents perform verification and correction, reducing the reliance on human intervention for all but the most complex cases. This leads to higher quality outputs and allows humans to focus their efforts on the truly challenging scenarios identified through confidence scoring.

Optimizing data formatting for downstream AI consumers

A critical yet often overlooked aspect of AI pipeline development is formatting extracted data for its intended consumer, whether that be an LLM or an embedding model. The ideal format depends on the data's complexity. For large, simple tables without merged cells, Markdown is an effective, token-efficient format that LLMs can reason on well. However, for tables with merged cells, where row and column spans are vital, HTML is a superior format. Dynamically choosing between Markdown and HTML based on table complexity, such as the presence of spans, ensures optimal performance. Furthermore, when designing Retrieval Augmented Generation (RAG) systems, it's crucial to consider the limitations of embedding models. Embedding models may not capture the nuanced meaning of dense tables as well as LLMs. Therefore, summarizing table contents into natural language for retrieval and then passing the original ground truth HTML structure to the LLM for reasoning can significantly improve accuracy and prevent silent failures.

Expanding agent capabilities beyond extraction to editing and creation

The next frontier for AI agents extends far beyond simple parsing and extraction. Modern applications require agents to perform multi-step workflows, including classification, document splitting, and even editing and content creation. For instance, an agent might need to classify documents, split large mail records into distinct parts, or fill out forms and generate new documents like slide decks or reports. Reducto offers specialized endpoints for these tasks: 'Parse' for faithful document representation, 'Split' for segmenting documents, 'Structured Extraction' for mapping data to schemas, and an 'Editing' endpoint for modifying documents. These capabilities allow for more sophisticated agentic workflows, enabling AI to handle end-to-end tasks that mimic human productivity.

Agent harnesses and the pursuit of superhuman performance through iterative evaluation

Agent harnesses offer a powerful mechanism for models to iteratively improve their own outputs, moving towards human-level or even superhuman performance. Features like Reducto's 'Deep Extract' utilize a parent agent to coordinate sub-agents, each equipped with specific validation rules. These agents repeatedly audit results, ensuring logical consistency, such as verifying that line items sum up to the total on an invoice. When compute time is not a constraint, these harnesses can achieve performance that surpasses human-in-the-loop processes, which are susceptible to human error, fatigue, or bias. This iterative self-correction is crucial for tackling complex tasks, including the accurate extraction of data from time-series charts, where specialized models can decompose the problem, interpret axes, and re-render outputs for verification, ultimately generating structured data tables from visual representations.

The paramount importance of end-to-end pipeline evaluation and a file-system architecture for agents

Effective pipeline development hinges on robust evaluation at every stage. Mistakenly focusing solely on a single input-output check is insufficient; evaluating each step—parsing, retrieval, formatting, and overall system performance—is critical to prevent cascading failures. Early stage errors in parsing compound significantly. Furthermore, the paradigm is shifting away from traditional RAG towards file-based architectures. In this model, agents are given a file system and tools to navigate and determine what information they need, removing friction associated with fixed chunking and embedding limitations. This approach, coupled with meticulous metadata provision (like bounding boxes for traceability), allows agents to robustly search, plan, and execute tasks. Ultimately, building systems where agents can not only retrieve but also write, edit, and produce deliverables is key to enabling end-to-end work, mirroring human processes of synthesis and creation.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

Best Practices for Building Better AI Agents

Practical takeaways from this episode

Do This

Decompose problems and use the right tool for the right task; specialization wins.

Add layers for agentic verification (VLMs, traditional CV) to catch last-mile mistakes.

Format data for its intended consumer (e.g., markdown for simple tables, HTML for complex ones).

Use routing, orchestration, and classification to break down problems into subproblems.

Evaluate your pipeline at every single stage with trusted data.

Understand that agents need more than just retrieval; consider tools like editing for end-to-end work.

Avoid This

Don't let data be the bottleneck for your agents.

Don't default to a one-size-fits-all approach for data processing.

Don't rely solely on first-pass accuracy; implement verification layers.

Don't stop at parsing and extraction; consider classification, splitting, and editing for complex workflows.

Don't treat evaluations as single-shot measurements; consider failure modes throughout the pipeline.

Don't assume retrieval is sufficient for agent functionality; explore additional tools for real-world tasks.

Common Questions

The primary bottleneck is providing high-quality, relevant data to the AI agents. Ensuring the data is accurately parsed, structured, and formatted for consumption by the agent is crucial for effective performance.

Topics

Ai Agents AI & Machine Learning Technology & Innovation Language Models Data Quality AI Evaluation Computer Vision Data Extraction Agentic Systems Document Processing

Mentioned in this video

Companies

Reducto

A company focused on building agentic document extraction for AI teams. They emphasize giving better data to agents and have processed over 3 billion documents.

Rogo

Mentioned as an example of a new category of AI application companies that Reducto works with.

Mercury

Mentioned as an example of a new category of AI application companies that Reducto works with.

Anthropic

Mentioned as a provider of frontier models that still exhibit errors in real-world document processing, such as pulling context from the wrong rows in tables.

Software & Apps

Harvey

Mentioned as an example of a new category of AI application companies that Reducto works with.

Large Language Models

VLMs are presented as the most significant advancement for solving document processing challenges, capable of reading handwritten text and improving accuracy.

HTML

HyperText Markup Language, found to be better for models to reason on complex tables with merged cells compared to markdown.

embedding models

Models used for creating numerical representations of data, discussed in contrast to language models, with limitations in capturing nuances of dense tables compared to LLMs.

Deep Extract

A feature that acts as an agent harness for structured extraction, composed of a parent agent and sub-agents with validation rules to iteratively audit and refine results.

CLI

Command Line Interface, mentioned as a tool provided to agents in a file-based architecture, enabling them to navigate file systems and decide if they need additional information.

Organizations

Scale AI

Mentioned as an example of a new category of AI application companies that Reducto works with.

Andreessen Horowitz

An investor in Reducto, having invested $108 million to date.

Benchmark

An investor in Reducto, having invested $108 million to date.

Concepts

Human-in-the-Loop

A common approach in AI development where humans review and correct AI outputs. Discussed as having potential issues like laziness and missing information, which agents may avoid.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

AI Dev 26 x SF | Adit Abraham: Better Agents with Better Data

Want to know something specific about what's covered?

Key Insights

The evolving landscape of AI applications: from chatbots to action-oriented agents

Why PDFs remain a significant challenge for AI

Leveraging vision-language models and traditional CV for robust document understanding

Agentic OCR and the quest for deterministic, high-fidelity extraction

Optimizing data formatting for downstream AI consumers

Expanding agent capabilities beyond extraction to editing and creation

Agent harnesses and the pursuit of superhuman performance through iterative evaluation

The paramount importance of end-to-end pipeline evaluation and a file-system architecture for agents

Mentioned in This Episode

Best Practices for Building Better AI Agents

Do This

Avoid This

Common Questions

Topics

Mentioned in this video

More from DeepLearningAI

AI Dev 26 x SF | Nyah Macklin: The AI Said So? How to Build Auditable AI Agents Using Context Graphs

AI Dev 26 x SF | William Imoh & Charlie Wood: Closing the Care Gap

AI Dev 26 x SF | Paige Bailey: What's New and What's Next in AI

AI Dev 26 x SF | Aditi Gupta: Building SRE Agents with the Redis Context Engine

Ask anything from this episode.