The Powerful Alternative To Fine-Tuning
Key Moments
Harnesses atop LLMs auto-improve, beating costly fine-tuning.
Key Insights
Poetic builds a recursively self-improving meta-system that outputs specialized reasoning 'harnesses' on top of one or more language models to solve hard problems.
This approach is cheaper and more scalable than fine-tuning from scratch, and remains compatible with new frontier models as they arrive.
The team has demonstrated strong results (ARC AGI V2, Humanity's Last Exam) at a fraction of traditional cost, leveraging a small, focused team.
The core value is automated optimization of prompts, data, and reasoning strategies, not just manual prompt editing; the system creates robust, tunable architectures.
Startup-ready access is being offered via poetic.ai for teams facing hard, reliability-challenging AI problems.
THE PROBLEM WITH FINE-TUNING AND THE NEED FOR SPEED
The traditional path of fine-tuning large models is costly and quickly becomes outpaced by faster model releases. The guest highlights that retraining from scratch demands hundreds of millions of dollars and months of effort, and new frontier models can render those gains obsolete almost instantly. Poetic offers a radically faster alternative by building on top of existing models and evolving capabilities without expensive retraining, addressing the 'bitter lesson' of losing ground to newer models.
POETIC'S CORE: RECURSIVE SELF-IMPROVEMENT AND THE META SYSTEM
At the heart of Poetic is a recursively self-improving meta-system that can generate and optimize entire reasoning pipelines, or harnesses, tailored to a given hard problem. This automation produces systems that consistently outperform the base models and remain compatible with future iterations. The approach shifts focus from training more data to evolving the reasoning architecture itself, enabling rapid, cost-effective improvements.
HARNESSING VS TRAINING: WHY POETIC SEES FRONTIER MODELS AS STILTS
Frontier models act as foundational stilts that Poetic uses to reach higher performance without rebuilding from scratch. The harness sits on top of these models and can be adapted to new models without changing the underlying deployment. By contrast with repeated full-model training, Poetic continuously optimizes the surrounding system—prompts, data handling, and reasoning strategies—so any new base model yields immediate gains without a full rewrite.
COST ADVANTAGE AND SCALABILITY: UNDER 100K FOR HARD PROBLEMS
The company emphasizes a dramatic cost advantage: a Humanity's Last Exam run costed well under six figures, with optimization costs under 100k, and a team of just seven researchers. They note being roughly half the cost of competing approaches (e.g., Gemini 3 DeepThink) due to building on a cheaper model (Gemini 3 Pro) and not performing full-scale retraining. The result is scalable, repeatable progress on hard tasks.
PROOF OF CONCEPT: ARC AGI V2 AND HUMANITY'S LAST EXAM RESULTS
Poetic has repeatedly outpaced contemporaries on difficult benchmarks. On ARC AGI V2, they surpassed prior leaders within days, leveraging cheaper underlying models yet achieving higher official verification scores. In Humanity's Last Exam, they achieved 55%—nearly two points above the previous state-of-the-art—on a 2,500-question challenge designed for expert domains. These results illustrate the system's ability to push hard problems beyond traditional baselines.
HOW THE POETIC META SYSTEM WORKS: PROMPTS, DATA, AND AUTOMATION
The Poetic stack combines code, prompts, and data into automated reasoning systems. The meta-system can optimize not only prompts but also deeper reasoning strategies and data generation, including context stuffing and example generation. Rather than hand-tuning, the system analyzes data and failure modes to extract robust, reusable reasoning patterns, enabling faster iteration and higher-quality outputs with less human intervention.
STARTUP ACCESS: HOW TO TRY POETIC AND SIGN UP
For startups interested in deploying Poetic, the company invites early access inquiries via poetic.ai. They’re seeking hard, reliability-challenging problems where existing approaches fall short and offer to work with teams to enhance or replace their current agents with stilts that scale with newer models. The pitch emphasizes readiness for practical deployment and collaboration with eager, high-potential ventures.
CAREER PATHS AND ADVICE: FROM CORPORATE RESEARCHER TO AGI-FOCUSED BUILD
Ian Fischer shares a personal trajectory from co-founding Portable (a mobile cross-platform tool) to Google and DeepMind, then pivoting to AI robotics and machine learning research. His advice to engineers is pragmatic: try things with AI every day, push boundaries, and build the things you envision. He illustrates this with a weekend experiment building an iPhone app using GPT-5, underscoring how rapidly capabilities are advancing and how inclusive tooling has become.
Mentioned in This Episode
●Tools & Products
●Books
●Studies Cited
Benchmark results and costs (selected data points)
Data extracted from this episode
| Benchmark / Model | Score (%) | Notes / Cost |
|---|---|---|
| ARC AGI V2 – Poetic harness on Gemini 3 Pro | 54 | Cost: $32 per problem |
| ARC AGI V2 baseline – Gemini 3 DeepThink | 45 | |
| Humanity's Last Exam – Poetic harness | 55 | Cost: < $100k |
| Humanity's Last Exam – Claude Opus 4.6 (Anthropic) | 53.1 |
Common Questions
Poetic offers a recursively self-improving meta-system that generates task-specific harnesses on top of one or more language models. Instead of retraining or fine-tuning a model for each task, Poetic optimizes the reasoning strategies and prompts automatically so the harness continues to improve as new models come out.
Topics
Mentioned in this video
Benchmark used to compare Poetic's results against other approaches.
Anthropic's model cited as the previous state-of-the-art on Humanity's Last Exam.
Baseline model whose performance was surpassed by Poetic harness on ARGI v2.
Base model used with Poetic harness to achieve higher performance.
A language model used to help build an iPhone app.
A benchmark composed of 2,500 hard questions across domains.
Referenced as a popular paper related to automated prompt optimization.
More from Y Combinator
View all 5 summaries
38 minCommon Mistakes With Vibe Coded Websites
24 minThe AI Agent Economy Is Here
51 minInside Claude Code With Its Creator Boris Cherny
8 minThe New Way To Build A Startup
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free