Key Moments
[Paper Club] DocETL: Agentic Query Rewriting + Eval for Complex Document Processing w Shreya Shankar
Key Moments
DocETL uses LLMs and an agentic framework for complex document processing via logical rewriting and plan evaluation.
Key Insights
DocETL formalizes document processing using concepts from database and Pandas operations, defining operators like map, reduce, and resolve.
The system employs 'rewrites' to optimize pipelines, transforming simple operations into more complex sequences for better performance.
Optimization involves generation and validation agents that automatically rewrite and evaluate pipelines for accuracy and efficiency.
Key operators include map for feature extraction, reduce for aggregation, and resolve for de-duplication and entity matching.
LM-centric improvements like gleaning and duplicate key resolution address challenges in LLM output consistency and accuracy.
Project synthesis enables hyper-optimization by breaking down complex tasks into smaller, chained LLM calls.
The optimization process involves generating numerous pipeline variants and selecting the best-performing ones based on evaluation metrics.
INTRODUCTION TO DOC ETL'S CONCEPTUAL FRAMEWORK
DocETL is a novel framework designed to leverage Large Language Models (LLMs) for robust document processing. It addresses the challenge of obtaining accurate outputs from LLMs for complex tasks by introducing a formal vocabulary and an agent-based system. The core idea is to apply data pipeline and database concepts to unstructured documents, moving beyond simple prompt-based extraction to a more structured and optimized approach. The system defines operators and directives, focusing on logical rewriting and agent-guided plan evaluation to enhance overall performance and accuracy.
CORE OPERATORS FOR DOCUMENT MANIPULATION
DocETL builds upon familiar data processing paradigms with several key operators. The 'map' operator is a workhorse, performing semantic projections to extract new features or information from documents, such as identifying misconduct instances and officer names. 'Reduce' aggregates mapped data into coherent reports, similar to generating summaries or reports. The 'resolve' operator is crucial for handling ambiguity and de-duplication, identifying when different textual references (e.g., slightly different spellings of a name) actually refer to the same entity, which is a notoriously difficult task.
AUXILIARY OPERATORS AND DATA DECOMPOSITION
Beyond the primary operators, DocETL includes auxiliary functions like 'unest' to flatten nested data structures and 'split' for chunking documents. 'Split' can be implemented in various ways, from simple token-based chunking to more sophisticated paragraph or section-based methods, and even LLM-based splitting. The 'gather' operator is vital for providing context across different document chunks, enabling LLMs to understand abbreviations or nuances from earlier sections. These tools are essential for preparing data for processing, especially with large documents that exceed context window limits.
REWRITING DIRECTIVES FOR PIPELINE OPTIMIZATION
The 'rewrite' directives in DocETL focus on optimizing the processing pipeline itself. This involves more than just processing the document; it's about restructuring the sequence of operations to improve results. Key directives include data decomposition, where large documents are broken down into smaller, manageable chunks to improve LLM attention and summarization quality. This process acknowledges that chunking is not always optimal and depends heavily on the specific task and data. Multi-level aggregation, akin to hierarchical summarization, allows for processing large datasets by summarizing progressively larger segments.
LM-CENTRIC IMPROVEMENTS AND DUPLICATE RESOLUTION
DocETL incorporates several LLM-centric enhancements to boost performance. 'Gleaning' involves an iterative process where an LLM is prompted to improve its previous outputs based on feedback from a validator, akin to a self-correction loop. This can involve one or more refinement steps. The 'duplicate key resolution' addresses the non-canonical nature of LLM outputs, ensuring that semantically equivalent values (like variations of a person's name) are correctly identified and consolidated. Precision in this step is critical to avoid merging distinct entities incorrectly.
OPTIMIZATION THROUGH AGENTIC EVALUATION AND REWRITING
The core optimization mechanism in DocETL relies on two types of agents: generation agents and validation agents. Generation agents propose new pipeline structures and rewrites, while validation agents assess their performance. This iterative process involves creating custom validation prompts, sampling outputs, rewriting sub-pipelines, and recursively refining optimizations. The system then evaluates numerous candidate plans on sample data, using pairwise comparisons to identify the most effective pipeline. This approach automates the complex task of pipeline design and tuning, significantly reducing manual effort and often exploring hundreds of pipeline variants.
PROJECT SYNTHESIS AND REAL-WORLD APPLICATION
Project synthesis is a hyper-optimization technique that breaks down complex tasks into a sequence of simpler LLM calls, analogous to chaining operations. For example, a map operation can be used to first filter a document for relevant information (projection synthesis) before a second, more focused map operation extracts specific details like misconduct instances. This is particularly useful for tasks requiring high precision, such as analyzing police misconduct reports. The paper highlights case studies demonstrating how DocETL can generate optimized pipelines at a fraction of the cost and time compared to manual development, even for datasets involving hundreds or thousands of documents.
EVALUATION AND THE ROLE OF LLM AS JUDGE
A critical aspect of DocETL's success is its reliance on LLMs as judges for evaluating pipeline performance. While initially met with skepticism, the paper demonstrates that with carefully crafted prompts and potentially binary classification metrics, LLMs can effectively assess output quality, precision, and recall. This is particularly effective for tasks with clear success criteria, like extracting a specific number of entities or identifying instances of misconduct. Human annotators can provide initial guidance or feedback, but the automated evaluation allows for rapid exploration and optimization of candidate pipelines at scale.
Mentioned in This Episode
●Software & Apps
●Organizations
●Concepts
●People Referenced
Common Questions
DocETL is a framework that uses Large Language Models (LLMs) to process documents. It addresses the inaccuracy of LLM outputs for complex tasks by providing a formal vocabulary with operators and directives to optimize document processing pipelines.
Topics
Mentioned in this video
An operator in DocETL that applies an LLM-powered projection to extract or create new features from documents.
Mentioned in the context of reducing the search space for the resolve operator.
An auxiliary operator that collects individual chunks and provides contextual information, such as abbreviations, to LLMs to improve understanding.
A prompting technique where an LLM breaks down complex tasks into smaller steps, discussed as a parallel to projection synthesis.
A case study used throughout the presentation to illustrate the functionalities and benefits of DocETL.
An operator designed to handle de-duplication and entity resolution, ensuring that different mentions of the same entity are recognized as identical.
Large Language Models, the core technology that DocETL leverages for document processing.
Mentioned alongside ORCA 3 regarding model capabilities in verifying output.
Mentioned as a cheaper alternative LLM for optimization tasks, with a lower cost compared to GPT-4o.
A data manipulation and analysis library, concepts from which are applied to document processing in DocETL.
An LLM used in an ensemble with GPT-2.5 to match GPT-4's judging performance.
Used in an ensemble with Command R+ to match GPT-4's judging performance.
Mentioned in the context of synthetic data generation and model capabilities in verifying output.
The framework discussed in the paper, designed for agentic query rewriting and evaluation in complex document processing.
Mentioned as an example LLM when discussing how to ask for error message fixes, similar to the gleaning improvement method.
An operator that aggregates extracted data from multiple documents into a coherent, human-readable report.
Mentioned in passing as an example of a dataset that could potentially be used with DocETL.
The institution where the journalism team works, running tasks that demonstrated the effectiveness of LLM judges.
More from Latent Space
View all 172 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free