Key Moments
Extreme Harness Engineering for the 1B token/day Dark Factory — Ryan Lopopolo, OpenAI Frontier
Key Moments
AI agents can now build complex software products, generating over a million lines of code with minimal human input, but the key is not code quality, but how the agents are 'harnessed'.
Key Insights
OpenAI's Frontier team developed an internal tool over 5 months, generating over 1 million lines of code with zero human-written code, resulting in a 10x faster development cycle compared to traditional methods.
The system evolved through multiple GPT model iterations (5.1-5.4), requiring significant adaptation in build systems, moving from Makefiles to Bazel, Turbo, and Nx to meet sub-minute build time objectives.
Human involvement has shifted from direct code review to post-merge analysis, with synchronous human attention identified as the primary bottleneck, leading to systems designed for agent autonomy.
The 'harness engineering' approach emphasizes defining non-functional requirements (observability, reliability, documentation) as text-based inputs that agents can directly process and enforce.
Symphony, an Elixir-based framework, demonstrates a novel approach to distributing software and ideas as 'ghost libraries,' enabling agents to reproduce complex systems from specifications with high fidelity.
The Frontier platform aims to enable enterprises to deploy observable, safe, and controllable AI agents, integrating with existing company infrastructure and security tooling, with a focus on agent SDKs and customizable safety specs.
AI agents can now build complex software without human code
Ryan Lopopolo from OpenAI's Frontier team discusses the emergence of 'harness engineering,' a paradigm shift where AI agents, specifically through OpenAI's Codex, are used to build complex software products. His team developed an internal tool over five months with their primary constraint being not to write any code themselves. This approach resulted in a codebase exceeding one million lines, achieving a development speed 10 times faster than traditional methods. The core idea is to leverage the advanced coding capabilities of AI models by providing them with the necessary 'harness'—the framework and tools—to perform tasks, effectively collapsing user journeys and product requirements into code.
Adapting to evolving AI capabilities and build system demands
The development process was iterative, progressing through multiple GPT-4 model generations (5.1 to 5.4). Each model iteration presented unique quirks and working styles, necessitating continuous adaptation of the codebase. A significant challenge was managing build times, especially after the introduction of background shells in model 5.3, which reduced the model's patience for long-running blocking scripts. The solution involved rapidly iterating through build systems, including Makefile, Bazel, Turbo, and Nx, to ensure build completion under one minute. This was crucial for maintaining agent productivity, illustrating how the development environment must be as agile as the AI models it supports. The ability to afford rapid iteration is linked to the low cost of tokens and high parallelism of the models.
Shifting human roles to oversight and strategic direction
With AI agents handling the bulk of code generation, the role of human engineers has transformed. The primary bottleneck has shifted from direct code creation and review to synchronous human attention. Most code review now occurs post-merge, with human focus directed towards understanding where the agent makes mistakes and identifying areas for automation to prevent future time expenditure. This systemic shift requires a 'systems thinking mindset,' continuously evaluating agent performance and confidence in automation. For instance, the team invested heavily in providing agents with observability tools, such as traces and metrics, to ensure modularity, reliability, and code diagnosability, thereby reducing the need for constant human terminal supervision during development.
Defining 'skills' and 'scaffolding' for agent autonomy
A key aspect of this approach is creating explicit 'skills' and 'scaffolds' that guide the AI agents. Instead of pre-defining strict scaffolds for agents to operate within, the focus shifted to providing a flexible framework where the agent, as the 'harness,' can make intelligent choices based on context. This includes using short markdown files for specifications (e.g., `spec.md`, `agent.mmd`) and structured 'skills' like 'Core Beliefs.md' or 'Tech Tracker.md.' These act as hooks for the agent (Codex) to review business logic, assess it against defined guardrails, and propose follow-up work. This method makes it cheaper to inject new knowledge and instructions into the system, ensuring agents can adapt and enforce process knowledge, such as requiring timeouts for network calls and updating documentation accordingly.
Dynamic interaction and feedback loops for continuous improvement
The system incorporates dynamic feedback loops to refine agent behavior. Initially, code-writing agents were too easily 'bullied' by review agents, leading to convergence issues. To counter this, prompts were adjusted to allow agents to push back or defer feedback, mirroring how human engineers handle review comments. Review agents were also instructed to bias towards merging and limit surfacing critical issues. This flexibility is crucial because AI agents, by default, seek to follow instructions precisely. The process involves capturing instances where agents deviate from non-functional requirements—signaled by PR comments, failed builds, or misalignment with documentation—and funneling this information back into the system to improve future agent performance. This continuous 'gardening' of the codebase and agent behavior aims to maintain invariants and reduce code dispersion.
Symphony: Distributing software and automating complex system generation
Symphony, an Elixir-based framework, represents a significant advancement in automating complex system generation. It allows for the creation of 'ghost libraries' or specifications that agents can use to reproduce systems locally. The process involves agents analyzing existing code, generating a spec, and then using another agent to implement that spec. This loop continues with review agents ensuring fidelity to the original system. The choice of Elixir and the Erlang VM is due to their robust process supervision and gen-server capabilities, ideal for orchestrating numerous asynchronous tasks. This approach enables humans to focus on truly novel or complex 'hard and new' problems, trusting the agents to handle more established or easier tasks, whether mundane or complex.
OpenAI Frontier: Enterprise-grade AI deployment and management
OpenAI Frontier is positioned as an enterprise platform for deploying AI agents safely and at scale, offering a suite of tools for AI transformation. Key components include an Agents SDK for building custom agents, and a platform that integrates with native enterprise identity management, security tooling, and workspace applications. A central 'control dashboard' provides IT, GRC, and security teams with oversight into agent deployment, individual agent trajectories, and adherence to regulatory requirements. The platform emphasizes making complex agents easy to compose safely, with features like the GPT OSS safeguard model allowing for customizable safety specs to prevent data exfiltration and ensure compliance with specific company policies. The goal is to provide a robust, observable, and controllable environment for AI deployment.
The future of software engineering: Agents as teammates
The overarching theme is the collaborative potential between humans and AI agents, fostering a paradigm where agents act as teammates. This involves building trust through mechanisms like clear documentation, automated testing, and observable agent trajectories, similar to how a human teammate would present their work. The process of internalizing dependencies, exemplified by potentially in-housing libraries like DataDog or Temporal, reduces reliance on external plugins and simplifies the system. The efficiency gained allows human engineers to tackle the most challenging problems—those that are 'pure whitespace' or require deep refactoring—while agents handle the more structured or repetitive tasks. This shift not only enhances productivity but also fundamentally redefines the practice of software engineering by integrating AI deeply into the development lifecycle, enabling continuous self-improvement and adaptation.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Common Questions
Harness engineering involves the systems thinking and tooling necessary to deploy AI agents effectively. It's crucial because it allows for the collapse of complex user journeys into code, enabling AI models to handle the 'wiring' and focus on execution via prompts.
Topics
Mentioned in this video
A harness used for building AI products, enabling communication through prompts to let models handle wiring and operation. It's integral to Ryan Lopopolo's approach to agent-driven development.
Mentioned as part of Ryan Lopopolo's background, indicating experience with enterprise customers.
Mentioned as part of Ryan Lopopolo's background, indicating experience with enterprise customers.
A platform for code hosting and collaboration, integrated into the development workflow with features like PRs and CLIs.
Mentioned as a potential alternative to GitHub for code hosting, indicating the spec's adaptability.
Mentioned as an example of a company that would likely need OpenAI Frontier's enterprise AI solutions.
A platform for orchestrating workflows and long-running processes, mentioned as a core inspiration for Symphony and its focus on process supervision and resumability.
The virtual machine for Elixir, praised for its concurrency model and features like resumability, valuable for agent orchestration.
Mentioned as a service that is still paid for, even as dependencies are increasingly internalized.
The company where Ryan Lopopolo works, developing AI models and platforms like Frontier and Codeex.
Referred to as 'GBD5' and iterations like 51, 52, 53, 54, indicating advancements in OpenAI's models used in the development process.
A build system mentioned in the context of adapting the codebase for faster build times, alongside NX.
Mentioned in the context of front-end architecture and complexity, specifically within an Electron single-app setup.
A framework used for building the application, noted for its main and renderer processes and its capability for MVC-style decomposition.
Mentioned as a 'tiny little bit of Python glue' used to spin up local development stacks.
A file used to define agent configurations and behaviors, mentioned alongside spec.md.
A markdown file or skill used to track and assess business logic against documented guardrails, proposing follow-up work for the agent.
Used as a communication channel where agents can be directed to perform tasks, such as updating documentation or fixing issues.
Mentioned in the context of packages within the repository's architecture.
The issue tracker used by the team, favored for its integration and ease of use.
The command-line interface for GitHub, used for interacting with repositories, creating pull requests, and viewing web UIs, noted for its token efficiency.
A code formatter mentioned in the context of CLIs and how agents can interact with them, focusing on the outcome (formatted or not) rather than individual file formatting steps.
A package manager mentioned in relation to its distributed script runner and the challenge of parsing large amounts of text from test suites.
Mentioned as a potential alternative to Linear for issue tracking, highlighting the flexibility of the spec to accommodate different tools.
A command-line tool discussed for its extensive flags and potential for being turned into micro-SaaS products.
A faster AI model mentioned as being useful for quick changes, documentation updates, and transforming feedback into lints, though its application for high-level reasoning is still being explored.
A linter mentioned in the context of adapting AI feedback into codebase infrastructure, specifically for transforming feedback into lints.
A company mentioned alongside Bolt and Replit as solving the zero-to-one product idea problem with AI, distinct from coding agents.
Mentioned alongside Lovable and Bolt as a company addressing the zero-to-one product idea challenge with AI, differentiating from coding agents.
A model within OpenAI Frontier that interfaces with safety specs, allowing enterprises to instrument agents to prevent data exfiltration and manage internal company information.
A dashboarding tool mentioned in the context of agents authoring JSON for dashboards and responding to alerts.
Mentioned as an example of a language that leverages shared types to reduce complexity, similar to how Elixir's runtime features aid process orchestration.
Mentioned as a technology that previously enabled shared types across front-end and back-end, now superseded or complemented by other approaches.
An open-source monitoring and alerting toolkit mentioned as an example of a tool run locally to enable a full development loop.
A programming language chosen for Symphony due to its process supervision and gen servers, which are well-suited for the type of process orchestration required.
A model related to safety specifications for enterprises, allowing customization of agent behavior to avoid exfiltration and manage proprietary information.
A tool used in the Symphony process for managing disconnected code, implementing specs, and reviewing implementations.
A web testing framework discussed in the context of integrating with the Electron app and the challenges of MCPs (injected context) that the agent might forget how to use.
Mentioned as a platform similar to Cursor that developers might use, where a similar level of review compression is expected.
Referred to as 'chat', used alongside specific models like 5.4 for tasks, and as a component within the broader AI workflow.
A system developed for iterative spec-driven development, leveraging Elixir and BEAM for process orchestration, aiming to remove human context-switching.
Mentioned as a previous model iteration, contrasting with the capabilities of newer models like 5.4.
A metric or assessment used by agents to evaluate business logic against guardrails, influencing proposed follow-up work.
A priority designation used by review agents, indicating that issues surfaced should not be greater than P2 to bias toward merging.
The highest priority level, indicating a critical issue that would 'nuke the codebase' if merged.
More from Latent Space
View all 204 summaries
77 minMarc Andreessen introspects on Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"
67 minMoonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning
38 minThe Stove Guy: Sam D'Amico Shows New AI Cooking Features on America's Most Powerful Stove at Impulse
55 minMistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free