Key Moments

Building AGI with OpenAI's Structured Outputs API

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read73 min video
Sep 17, 2024|2,144 views|79|7
Save to Pod
TL;DR

OpenAI's Michelle on Structured Outputs API: Enabling agents, use cases, and the future of AI development.

Key Insights

1

OpenAI's Structured Outputs API aims to simplify agent development by ensuring reliable, structured data exchange with models.

2

Structured Outputs is designed as a more purpose-built solution for structured responses compared to Function Calling, which is for tool execution.

3

The API combines engineering constraints with model training to improve adherence to formats and reduce errors like excessive whitespace.

4

A new 'refusal' field in the API allows models to decline harmful or policy-violating requests while maintaining a clear developer experience.

5

OpenAI is continuously working on improving model performance, reducing latency, and expanding the capabilities of its API platform.

6

The API's roadmap includes exploring custom grammars beyond JSON schema and enhancing features for AI agents and enterprise use.

THE EVOLUTION OF OPENAI'S API AND STRUCTURAL OUTPUTS

The discussion begins by tracing the speaker's career path through notable tech companies like Coinbase, Stripe, and ultimately OpenAI, highlighting experiences with scaling challenges and product development. Joining OpenAI before the ChatGPT launch, the speaker was drawn to products like GitHub Copilot. The narrative then shifts to the genesis of Structured Outputs, stemming from the introduction of JSON mode at Dev Day last year. JSON mode was an initial step to constrain model output to JSON, but it had limitations, often leading to developers wanting more precise control over keys and values, which then paved the way for the more robust Structured Outputs API.

STRUCTURED OUTPUTS VS. FUNCTION CALLING VS. JSON MODE

A key distinction is made between various API features. JSON mode is for basic JSON output constraints. Function Calling is specifically designed for enabling models to call external tools or functions, providing arguments for actual actions. Structured Outputs, however, is presented as a new response format optimized for getting the model to respond to a user in a structured way, distinct from tool invocation. While Function Calling has been adapted for structured responses, the new format is intended to provide more of the model's 'voice' and programmatic control for developers needing exact outputs for integration.

DESIGN PRINCIPLES AND TECHNICAL IMPLEMENTATION

The development of Structured Outputs involved both engineering and research. The engineering side focuses on constraining the model's output, for example, by limiting available tokens to fit a schema. The research aspect involves training the model to better understand and adhere to desired formats. This dual approach addresses issues like models outputting excessive whitespace, which can occur with purely engineering-driven constraints. The API aims to be developer-friendly, with SDK integrations allowing the use of Pydantic or Zod objects, abstracting away serialization complexities for a smoother user experience.

HANDLING ERRORS AND MAINTAINING SAFETY

A significant feature highlighted is the 'refusal' field within the API. This allows the model to refuse requests that might be harmful or violate policies, even when operating under a specific schema. This is crucial for safety and maintaining model integrity. The decision to use a refusal field instead of standard HTTP error codes is explained by the unique nature of AI errors, which don't always fit traditional Web 2.0 paradigms and involve model-specific behaviors. This provides a clearer developer experience for handling such refusals gracefully.

USE CASES AND THE FUTURE OF AGENTS

Structured Outputs is presented as a foundational building block for agentic applications, aiming to increase reliability in chained LLM calls from 95% to near 100%. Use cases include extracting structured data from unstructured text, dynamic UI generation using recursive schemas, and enabling more robust math tutoring systems by specifying step-by-step reasoning. The API's ability to handle nested structures makes it ideal for generating complex UIs. The speaker emphasizes that this feature is designed to make agentic workflows more stable and programmable, reducing the friction for developers building sophisticated AI applications.

MODEL SELECTION, FINE-TUNING, AND API ROADMAP

OpenAI offers various models, with recommendations to start with GPT-4 Turbo Mini for cost-effectiveness and scale up to GPT-4 Turbo for higher performance. The fine-tuning API is also highlighted as a powerful tool for achieving specific performance goals, with recent improvements making it more accessible. The roadmap includes exploring custom grammars beyond JSON schema, enhancing agent capabilities, and continuing to improve reliability and reduce latency. Features like parallel function calling, batch processing for cost savings, and advancements in Vision and Whisper APIs are also discussed as ongoing developments.

Common Questions

Structured Outputs is an OpenAI API feature that constrains model responses to strictly adhere to a defined JSON schema. This drastically improves reliability for developers integrating LLMs into applications, ensuring outputs match expected formats and types, unlike previous methods like JSON mode.

Topics

Mentioned in this video

Software & Apps
GPT-4o mini

OpenAI's cheapest and a recommended 'workhorse' model for many use cases, suitable to start with for cost-effectiveness.

Python

A programming language mentioned in the context of potential future grammar support for structured outputs and available for notebook-based solutions.

Vision API

OpenAI's API for processing visual inputs, usable across various OpenAI services and offering capabilities for data extraction from images.

Batch API

A cost-effective API for bulk processing tasks with a 24-hour turnaround, ideal for user activation flows, evaluations, and other non-time-sensitive jobs.

Hacker News

A platform where Michelle discovered GitHub Copilot and later learned about Stripe and Coinbase, influencing her career path.

Readwise

A company Michelle co-founded at the University of Waterloo as part of an entrepreneurship co-op program.

ChatGPT

A product that excites many joining OpenAI, though Michelle's initial draw was GitHub Copilot. It also presented scaling challenges for the API due to tied accounts.

Zod

A TypeScript library for schema declaration and validation, mentioned as a way to use structured outputs easily by passing Zod objects.

Assistant API

An OpenAI API that offers hosted tools like file search and code interpreter, and supports statefulness for conversational AI applications.

GPT-4 Turbo

A powerful model available through OpenAI's API, with specific versions like '40 Turbo' and '40-preview' offering different capabilities and tuning.

Clubhouse

A company Michelle worked at as one of the first backend engineers, experiencing rapid growth and scaling challenges, including meltdowns during high-profile events.

Visual Basic

A programming language Michelle learned during a rough early job at a bank, which she did not enjoy.

Git

A version control system Michelle learned during her internship at Coinbase, highlighting her early engineering development.

GitHub Copilot

A product that deeply impressed Michelle, leading her to join OpenAI, due to its high quality and transformative potential.

DALL-E

An OpenAI product showcased at a launch event that Michelle attended, which she found cool but less compelling for her career focus than Copilot.

Node.js SDK

The software development kit where the 'runs' beta feature for function calling was first implemented, allowing for closing the loop in conversations.

LLaMA CPP

A project where constrained grammar mechanisms, like using Backus-Naur form, were first observed, influencing discussions around grammar in LLMs.

Code Interpreter

A tool offered through the Assistant API that allows models to execute code, simplifying complex tasks for developers.

GPT-3.5

An earlier generation of OpenAI models, with the discussion mentioning the nearing end of its 'run-rip' for certain applications.

40 Turbo

An OpenAI model recommended for advanced use cases when GPT-4o mini doesn't meet performance needs.

Whisper

OpenAI's speech-to-text model, with discussions covering its API limitations (lack of diarization) and potential future improvements.

LifeKit

A socket-based approach used by OpenAI for the ChatGPT app, suggesting a direction for real-time, interactive APIs like speech-to-speech.

More from Latent Space

View all 167 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free