What does 'stils' mean in Poetic's approach?

Stilts refer to the harness systems that sit on top of base language models. They provide a performance boost and remain compatible with newer models, avoiding costly full retraining every time a new model is released.

How much did the Humanity's Last Exam run cost?

The team stated the optimization costs were under 100k, with the overall run costing less than six figures.

How many people work at Poetic?

Poetic comprises seven researchers and engineers, plus others, totaling seven people on the team.

What is Humanity's Last Exam and how did Poetic perform on it?

Humanity's Last Exam is a benchmark of 2,500 challenging questions. Poetic achieved 55% on it, edging out the previous state-of-the-art at 53.1% set by Claude Opus 4.6.

What is ARC AGI V2 and how did Poetic perform there?

ARC AGI V2 is a benchmark. Poetic achieved 54% on the ARC AGI V2, beating the 45% baseline, with a cost of about $32 per problem.

How can a startup start using Poetic?

Visit poetic.ai to sign up for early access. If you have a hard problem and can’t reach reliability on your own, you can reach out to the team for potential collaboration.

What advice does Ian give for engineers moving into applied AI?

He suggests trying things with AI every day, pushing boundaries, and building the projects you want to build, including personal experiments like building an iPhone app with GPT-5.

Key Moments

The Powerful Alternative To Fine-Tuning

Q: What is Poetic and how does it differ from traditional RL and fine-tuning?

Poetic offers a recursively self-improving meta-system that generates task-specific harnesses on top of one or more language models. Instead of retraining or fine-tuning a model for each task, Poetic optimizes the reasoning strategies and prompts automatically so the harness continues to improve as new models come out.

Y Combinator

Science & Technology3 min read20 min video

Feb 27, 2026|53,634 views|965|76

YC Y Combinator

Save to Pod

Key Moments

On this page

TL;DR

Harnesses atop LLMs auto-improve, beating costly fine-tuning.

Key Insights

Poetic builds a recursively self-improving meta-system that outputs specialized reasoning 'harnesses' on top of one or more language models to solve hard problems.

This approach is cheaper and more scalable than fine-tuning from scratch, and remains compatible with new frontier models as they arrive.

The team has demonstrated strong results (ARC AGI V2, Humanity's Last Exam) at a fraction of traditional cost, leveraging a small, focused team.

The core value is automated optimization of prompts, data, and reasoning strategies, not just manual prompt editing; the system creates robust, tunable architectures.

Startup-ready access is being offered via poetic.ai for teams facing hard, reliability-challenging AI problems.

THE PROBLEM WITH FINE-TUNING AND THE NEED FOR SPEED

The traditional path of fine-tuning large models is costly and quickly becomes outpaced by faster model releases. The guest highlights that retraining from scratch demands hundreds of millions of dollars and months of effort, and new frontier models can render those gains obsolete almost instantly. Poetic offers a radically faster alternative by building on top of existing models and evolving capabilities without expensive retraining, addressing the 'bitter lesson' of losing ground to newer models.

POETIC'S CORE: RECURSIVE SELF-IMPROVEMENT AND THE META SYSTEM

At the heart of Poetic is a recursively self-improving meta-system that can generate and optimize entire reasoning pipelines, or harnesses, tailored to a given hard problem. This automation produces systems that consistently outperform the base models and remain compatible with future iterations. The approach shifts focus from training more data to evolving the reasoning architecture itself, enabling rapid, cost-effective improvements.

HARNESSING VS TRAINING: WHY POETIC SEES FRONTIER MODELS AS STILTS

Frontier models act as foundational stilts that Poetic uses to reach higher performance without rebuilding from scratch. The harness sits on top of these models and can be adapted to new models without changing the underlying deployment. By contrast with repeated full-model training, Poetic continuously optimizes the surrounding system—prompts, data handling, and reasoning strategies—so any new base model yields immediate gains without a full rewrite.

COST ADVANTAGE AND SCALABILITY: UNDER 100K FOR HARD PROBLEMS

The company emphasizes a dramatic cost advantage: a Humanity's Last Exam run costed well under six figures, with optimization costs under 100k, and a team of just seven researchers. They note being roughly half the cost of competing approaches (e.g., Gemini 3 DeepThink) due to building on a cheaper model (Gemini 3 Pro) and not performing full-scale retraining. The result is scalable, repeatable progress on hard tasks.

PROOF OF CONCEPT: ARC AGI V2 AND HUMANITY'S LAST EXAM RESULTS

Poetic has repeatedly outpaced contemporaries on difficult benchmarks. On ARC AGI V2, they surpassed prior leaders within days, leveraging cheaper underlying models yet achieving higher official verification scores. In Humanity's Last Exam, they achieved 55%—nearly two points above the previous state-of-the-art—on a 2,500-question challenge designed for expert domains. These results illustrate the system's ability to push hard problems beyond traditional baselines.

HOW THE POETIC META SYSTEM WORKS: PROMPTS, DATA, AND AUTOMATION

The Poetic stack combines code, prompts, and data into automated reasoning systems. The meta-system can optimize not only prompts but also deeper reasoning strategies and data generation, including context stuffing and example generation. Rather than hand-tuning, the system analyzes data and failure modes to extract robust, reusable reasoning patterns, enabling faster iteration and higher-quality outputs with less human intervention.

STARTUP ACCESS: HOW TO TRY POETIC AND SIGN UP

For startups interested in deploying Poetic, the company invites early access inquiries via poetic.ai. They’re seeking hard, reliability-challenging problems where existing approaches fall short and offer to work with teams to enhance or replace their current agents with stilts that scale with newer models. The pitch emphasizes readiness for practical deployment and collaboration with eager, high-potential ventures.

CAREER PATHS AND ADVICE: FROM CORPORATE RESEARCHER TO AGI-FOCUSED BUILD

Ian Fischer shares a personal trajectory from co-founding Portable (a mobile cross-platform tool) to Google and DeepMind, then pivoting to AI robotics and machine learning research. His advice to engineers is pragmatic: try things with AI every day, push boundaries, and build the things you envision. He illustrates this with a weekend experiment building an iPhone app using GPT-5, underscoring how rapidly capabilities are advancing and how inclusive tooling has become.

Mentioned in This Episode

●Software & Apps

●Books

●Studies Cited

Benchmark results and costs (selected data points)

Data extracted from this episode

Benchmark / Model	Score (%)	Notes / Cost
ARC AGI V2 – Poetic harness on Gemini 3 Pro	54	Cost: $32 per problem
ARC AGI V2 baseline – Gemini 3 DeepThink	45
Humanity's Last Exam – Poetic harness	55	Cost: < $100k
Humanity's Last Exam – Claude Opus 4.6 (Anthropic)	53.1

Common Questions

Poetic offers a recursively self-improving meta-system that generates task-specific harnesses on top of one or more language models. Instead of retraining or fine-tuning a model for each task, Poetic optimizes the reasoning strategies and prompts automatically so the harness continues to improve as new models come out.

Topics

Startup Automation Gpt-5 Prompt Optimization Poetic Meta System Ai Harnesses Humanity'S Last Exam Fine-Tuning Alternatives Context Engineering Gemini 3 Pro Arc Agi V2 Ai Experimentation Ai For Startups Stilt Analogy Recursive Self-Improvement Claude Opus 4.6

Mentioned in this video

Studies & Research

ARC AGI V2

Benchmark used to compare Poetic's results against other approaches.

Jeepa paper

Referenced as a popular paper related to automated prompt optimization.

Software & Apps

Claude Opus 4.6

Anthropic's model cited as the previous state-of-the-art on Humanity's Last Exam.

Gemini 3 DeepThink

Baseline model whose performance was surpassed by Poetic harness on ARGI v2.

Gemini 3 Pro

Base model used with Poetic harness to achieve higher performance.

GPT-5

A language model used to help build an iPhone app.

Concepts

Humanity's Last Exam

A benchmark composed of 2,500 hard questions across domains.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free