How did declining LLM prices affect OpenPipe's strategy?

As frontier models like GPT-4 became cheaper, OpenPipe's initial value proposition of cost reduction through distillation diminished, prompting a shift towards Reinforcement Learning (RL).

What are the main reasons to fine-tune an LLM?

Fine-tuning is typically considered when cost reduction, latency improvements (especially for real-time applications), or enhanced quality and consistency are critical. It's often essential when forced to use smaller models.

What led OpenPipe to pivot towards Reinforcement Learning (RL)?

The realization that RL was becoming effective for LLMs, particularly for agentic tasks, and the potential for task-specific customization drove the company's significant investment in RL.

What are the main challenges in implementing RL for LLMs?

Key challenges include the complexity of the underlying math (though often presented as more complex than it is), the difficulty in setting up reproducible environments for training, and ensuring reliable reward signals.

What is the difference between DPO and GRPO in RL?

GRPO offers operational simplicity by removing the need for a separate value model and allows for group-wise comparisons, which can be easier for humans to provide feedback on compared to absolute scoring.

Why is creating reproducible environments for RL so difficult?

It's challenging because it requires building a system that precisely mimics the behavior and failure modes of the real production system, including simulating user interactions, which is complex and infrastructure-intensive.

What is Ruler and what problem does it solve?

Ruler is a library that uses LLMs to provide relative rewards for RL training. It simplifies reward assignment by having LLMs rank the outcomes of multiple agent runs, effectively making the reward problem easier to solve.

How does OpenPipe's Ruler leverage LLMs for rewards?

Ruler uses LLMs to comparatively judge the results of an agent's actions across multiple runs. This relative judgment simplifies the reward process, as the LLM doesn't need an absolute understanding of 'good' or 'bad'.

What does OpenPipe envision as the future of AI agents?

The vision is for every agent to learn continually from its real-world experience, making deployment more reliable and accessible. This involves solving the remaining challenges in environments and continuous learning.

What is the significance of continual learning for AI agents?

Continual learning allows agents to adjust their behavior based on new experiences and errors, similar to training a human employee. This makes agents more reliable in production and unlocks the potential for currently stalled AI projects.

What is the Y Combinator philosophy that guided Kyle Corbitt?

A key principle was 'hold your problem tight and your solution loosely,' emphasizing deep understanding of the user's problem and flexibility in how that problem is solved.

Key Moments

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

Latent Space Podcast

Science & Technology4 min read69 min video

Oct 16, 2025|8,598 views|157|10

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

OpenPipe pivoted from model distillation to RL for AI agents, acquired by CoreWeave.

Key Insights

OpenPipe initially focused on distilling expensive GPT-4 workflows into smaller models but pivoted to RL for agent training as frontier model prices dropped.

Reinforcement Learning (RL) is becoming crucial for AI agents, especially for complex and agentic tasks.

While LoRAs offer advantages in training and inference flexibility, their perceived 'uncoolness' was tied to the overall decline in fine-tuning interest.

Fine-tuning is primarily driven by cost, latency, or quality requirements, especially when forced to use smaller models.

The complexity of building robust, reproducible environments for RL training remains a significant challenge.

The 'Ruler' library by OpenPipe simplifies RL by using LLMs for relative reward assignment, significantly de-risking the reward problem.

The acquisition by CoreWeave, driven by the Weights & Biases team, aims to integrate OpenPipe's RL expertise into their ecosystem.

Continuous learning for agents, where they learn from real-world experience, a core vision for OpenPipe, promises to increase reliability and unlock more AI inference.

FROM STARTUP SCHOOL TO OPENPIPE'S ORIGINS

Kyle Corbitt, co-founder and CEO of OpenPipe, shared his journey starting from Y Combinator's Startup School. After leaving YC, he explored various ideas in AI, eventually co-founding OpenPipe in early 2023 with his brother. Their initial idea was sparked by the high cost of GPT-4, aiming to distill its capabilities into smaller, more cost-effective models for production workflows. This focused on providing a managed and clean distillation process, offering a clear value proposition to early adopters struggling with the expense of frontier models.

THE SHIFT FROM DISTILLATION TO REINFORCEMENT LEARNING

OpenPipe's initial business model faced challenges as frontier model prices significantly decreased. This led them to pivot towards Reinforcement Learning (RL) for AI agent training. They recognized a shift in the AI landscape, anticipating that RL would become critical for developing sophisticated agents capable of complex tasks. This pivot was a significant bet on the future of AI, moving from task-specific model optimization to training intelligent agents that could learn and adapt through experience.

THE ROLE AND EVOLUTION OF FINE-TUNING TECHNIQUES

The conversation touched upon various fine-tuning techniques, including LoRAs, which offer advantages like reduced memory usage during training and multiplexing capabilities at inference. Despite initial market skepticism, LoRAs are seen as valuable for lightweight customization. The primary drivers for fine-tuning remain cost-effectiveness, reduced latency, and improved quality or consistency, particularly when adopting smaller models for specific applications. However, the overall investment in fine-tuning was influenced by the broader interest in AI capabilities.

THE PROMISE AND CHALLENGES OF REINFORCEMENT LEARNING

Reinforcement Learning is highlighted as a key technology for advanced AI agents, especially for 'agentic' tasks. While the big AI labs invest heavily in RL environments and see significant results, applying it effectively to task-specific customization presents challenges. The most significant hurdle is the creation of robust, reproducible, and realistic environments for training. This requires simulating complex systems, including potential failure modes and diverse user interactions, which is a substantial infrastructural undertaking.

OPENPIPE'S 'RULER' AND SOLVING THE REWARD PROBLEM

To address the complexities of RL, OpenPipe released 'Ruler,' a library that simplifies reward assignment. Leveraging the insight from GRPG that relative judgment is often sufficient, Ruler uses LLMs to evaluate and rank multiple agent runs. This approach bypasses the need for absolute reward functions, making the reward problem significantly more manageable. This innovation has been instrumental in enabling agents to perform at state-of-the-art levels, even with less powerful judge models, and has made RL more accessible for a wider range of tasks.

THE ACQUISITION BY COREWEAVE AND FUTURE VISIONS

OpenPipe was acquired by CoreWeave, driven by the Weights & Biases team, to expand their offerings in the AI tooling space. The company's vision is to create a world where every agent learns continuously from real-world experience. This focus on continuous learning aims to solve the reliability gap that prevents many AI prototypes from reaching production. By enabling agents to learn from their mistakes and adapt, OpenPipe seeks to unlock a significant portion of the current AI inference market that remains stuck in the proof-of-concept stage due to reliability concerns.

CONTINUAL LEARNING AND THE RELIABILITY IMPERATIVE

The core objective for OpenPipe, now within CoreWeave, is to build the software infrastructure that facilitates agents learning continually from their real-world interactions. This approach is analogous to training human employees, where feedback and correction refine behavior over time. Solving this reliability issue is seen as key to deploying AI agents at scale, moving beyond controlled environments to production systems. This would unlock vast potential for AI inference currently hindered by the challenge of ensuring consistent and dependable performance.

ADDRESSING REWARD HACKING AND THE YC ETHOS

Reward hacking, a potential issue in RL, is considered manageable by OpenPipe. They believe that when it occurs, it's easily detectable and can be corrected by adjusting the reward function, often using LLMs as judges. Reflecting on his entrepreneurial journey, Corbitt values YC's advice to 'hold your problem tight and your solution loosely,' emphasizing adaptability. While generally aligned with YC's rapid iteration approach, he acknowledges a potential benefit in pursuing more ambitious, long-term visions with less focus on immediate shipping, depending on the founder's intrinsic vision and taste.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

Common Questions

OpenPipe initially focused on distilling workflows from expensive and powerful models like GPT-4 into smaller, more cost-effective models for production use.

Topics

Startup School CoreWeave Model Routing

Mentioned in this video

Companies

Varys

A portfolio company that works with agents on their internal tool call loops, particularly for less sexy enterprise use cases like chatbots.

OpenPipe

A company focused on fine-tuning and distillation of large language models, recently acquired by CoreWeave.

Software & Apps

Web Arena

An example of an academic environment for RL research, featuring clones of popular websites like Reddit and Wikipedia.

Ruler

A library released by OpenPipe that enables universal LLM-elicited rewards, simplifying the reward assignment problem in RL.

Concepts

DPO

Direct Preference Optimization, an RL algorithm mentioned in the context of its complexity and comparison to GRPO.

GRPO

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free