Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)
Key Moments
OpenPipe pivoted from model distillation to RL for AI agents, acquired by CoreWeave.
Key Insights
OpenPipe initially focused on distilling expensive GPT-4 workflows into smaller models but pivoted to RL for agent training as frontier model prices dropped.
Reinforcement Learning (RL) is becoming crucial for AI agents, especially for complex and agentic tasks.
While LoRAs offer advantages in training and inference flexibility, their perceived 'uncoolness' was tied to the overall decline in fine-tuning interest.
Fine-tuning is primarily driven by cost, latency, or quality requirements, especially when forced to use smaller models.
The complexity of building robust, reproducible environments for RL training remains a significant challenge.
The 'Ruler' library by OpenPipe simplifies RL by using LLMs for relative reward assignment, significantly de-risking the reward problem.
The acquisition by CoreWeave, driven by the Weights & Biases team, aims to integrate OpenPipe's RL expertise into their ecosystem.
Continuous learning for agents, where they learn from real-world experience, a core vision for OpenPipe, promises to increase reliability and unlock more AI inference.
FROM STARTUP SCHOOL TO OPENPIPE'S ORIGINS
Kyle Corbitt, co-founder and CEO of OpenPipe, shared his journey starting from Y Combinator's Startup School. After leaving YC, he explored various ideas in AI, eventually co-founding OpenPipe in early 2023 with his brother. Their initial idea was sparked by the high cost of GPT-4, aiming to distill its capabilities into smaller, more cost-effective models for production workflows. This focused on providing a managed and clean distillation process, offering a clear value proposition to early adopters struggling with the expense of frontier models.
THE SHIFT FROM DISTILLATION TO REINFORCEMENT LEARNING
OpenPipe's initial business model faced challenges as frontier model prices significantly decreased. This led them to pivot towards Reinforcement Learning (RL) for AI agent training. They recognized a shift in the AI landscape, anticipating that RL would become critical for developing sophisticated agents capable of complex tasks. This pivot was a significant bet on the future of AI, moving from task-specific model optimization to training intelligent agents that could learn and adapt through experience.
THE ROLE AND EVOLUTION OF FINE-TUNING TECHNIQUES
The conversation touched upon various fine-tuning techniques, including LoRAs, which offer advantages like reduced memory usage during training and multiplexing capabilities at inference. Despite initial market skepticism, LoRAs are seen as valuable for lightweight customization. The primary drivers for fine-tuning remain cost-effectiveness, reduced latency, and improved quality or consistency, particularly when adopting smaller models for specific applications. However, the overall investment in fine-tuning was influenced by the broader interest in AI capabilities.
THE PROMISE AND CHALLENGES OF REINFORCEMENT LEARNING
Reinforcement Learning is highlighted as a key technology for advanced AI agents, especially for 'agentic' tasks. While the big AI labs invest heavily in RL environments and see significant results, applying it effectively to task-specific customization presents challenges. The most significant hurdle is the creation of robust, reproducible, and realistic environments for training. This requires simulating complex systems, including potential failure modes and diverse user interactions, which is a substantial infrastructural undertaking.
OPENPIPE'S 'RULER' AND SOLVING THE REWARD PROBLEM
To address the complexities of RL, OpenPipe released 'Ruler,' a library that simplifies reward assignment. Leveraging the insight from GRPG that relative judgment is often sufficient, Ruler uses LLMs to evaluate and rank multiple agent runs. This approach bypasses the need for absolute reward functions, making the reward problem significantly more manageable. This innovation has been instrumental in enabling agents to perform at state-of-the-art levels, even with less powerful judge models, and has made RL more accessible for a wider range of tasks.
THE ACQUISITION BY COREWEAVE AND FUTURE VISIONS
OpenPipe was acquired by CoreWeave, driven by the Weights & Biases team, to expand their offerings in the AI tooling space. The company's vision is to create a world where every agent learns continuously from real-world experience. This focus on continuous learning aims to solve the reliability gap that prevents many AI prototypes from reaching production. By enabling agents to learn from their mistakes and adapt, OpenPipe seeks to unlock a significant portion of the current AI inference market that remains stuck in the proof-of-concept stage due to reliability concerns.
CONTINUAL LEARNING AND THE RELIABILITY IMPERATIVE
The core objective for OpenPipe, now within CoreWeave, is to build the software infrastructure that facilitates agents learning continually from their real-world interactions. This approach is analogous to training human employees, where feedback and correction refine behavior over time. Solving this reliability issue is seen as key to deploying AI agents at scale, moving beyond controlled environments to production systems. This would unlock vast potential for AI inference currently hindered by the challenge of ensuring consistent and dependable performance.
ADDRESSING REWARD HACKING AND THE YC ETHOS
Reward hacking, a potential issue in RL, is considered manageable by OpenPipe. They believe that when it occurs, it's easily detectable and can be corrected by adjusting the reward function, often using LLMs as judges. Reflecting on his entrepreneurial journey, Corbitt values YC's advice to 'hold your problem tight and your solution loosely,' emphasizing adaptability. While generally aligned with YC's rapid iteration approach, he acknowledges a potential benefit in pursuing more ambitious, long-term visions with less focus on immediate shipping, depending on the founder's intrinsic vision and taste.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
Common Questions
OpenPipe initially focused on distilling workflows from expensive and powerful models like GPT-4 into smaller, more cost-effective models for production use.
Topics
Mentioned in this video
A portfolio company that works with agents on their internal tool call loops, particularly for less sexy enterprise use cases like chatbots.
An example of an academic environment for RL research, featuring clones of popular websites like Reddit and Wikipedia.
A library released by OpenPipe that enables universal LLM-elicited rewards, simplifying the reward assignment problem in RL.
A company focused on fine-tuning and distillation of large language models, recently acquired by CoreWeave.
Direct Preference Optimization, an RL algorithm mentioned in the context of its complexity and comparison to GRPO.
More from Latent Space
View all 68 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free