Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Latent Space PodcastLatent Space Podcast
Science & Technology3 min read69 min video
Feb 5, 2026|3,159 views|74|9
Save to Pod

Key Moments

TL;DR

Goodfire AI pioneers mechanistic interpretability for safer, more powerful AI, securing $150M Series B.

Key Insights

1

Interpretability is broadly defined, encompassing more than just post-hoc analysis; it aims for intentional AI design throughout the lifecycle.

2

Goodfire focuses on applying interpretability to real-world problems, translating frontier research into production workflows and APIs.

3

Mechanistic interpretability has applications beyond language models, including scientific discovery (healthcare, material science) and visual domains (world models, diffusion).

4

Techniques like Steering and SAEs (Sparse Autoencoder) aim to provide surgical control over model behavior, though challenges like feature representation and transferability exist.

5

The field of interpretability is rapidly growing, with a lower barrier to entry, attracting talent from diverse scientific backgrounds.

6

There's a strong emphasis on scalable oversight and human involvement in the future of AI, aiming for intentional design rather than solely reactive training methods.

DEFINING INTERPRETABILITY AND GOODFIRE'S MISSION

Interpretability, a term with diverse definitions, is central to Goodfire AI's mission. Goodfire views itself as an AI research company specializing in interpretability methods to understand, learn from, and design AI models. Their vision extends beyond a black-box approach, aiming to bring interpretability to the entire AI development lifecycle, from data curation during training to understanding internal representations post-training. This perspective positions interpretability as key to unlocking the next frontier of safe and powerful AI.

FROM RESEARCH TO PRODUCTION: GOODFIRE'S APPROACH

Goodfire bridges the gap between academic interpretability research and practical, real-world applications. They focus on developing repeatable production workflows and APIs, moving interpretability out of the research lab and into enterprise deployments. This involves deep engagement with customers to understand pressing issues and then applying cutting-edge interpretability techniques. Failures and shortcomings encountered in these applications inform their research agenda, driving advancements in areas like foundational interpretability models and control mechanisms.

APPLICATIONS: ADDRESSING CHALLENGES AND UNLOCKING NEW CAPABILITIES

Interpretability aims to tackle various AI challenges, from unintended side effects like reward hacking and hallucinations to specific biases (e.g., political bias). Techniques like activation steering can be used for 'surgical edits,' allowing for precise modifications of model behavior. This is crucial for tasks like unlearning undesirable traits or enhancing desired ones. Furthermore, interpretability is vital for scientific discovery, enabling understanding of complex models in domains like genomics and medical imaging to identify novel biomarkers or validate biological relevance.

EXPLORING INTERPRETABILITY TECHNIQUES AND CHALLENGES

Key techniques discussed include Sparse Autoencoders (SAEs) for feature extraction and Probes for classification. While SAEs aim to capture concepts sparsely, challenges arise when their learned feature space isn't as clean or accurate as expected for downstream tasks, especially with clean datasets. Probes, trained on raw activations, sometimes outperform SAE probes in specific detection tasks. The practical application of these techniques, like identifying PII in multilingual e-commerce data, highlights complexities such as synthetic-to-real transfer and token-level precision, demanding robust solutions.

STEERING AND MODEL DESIGN: ENABLING INTENTIONALITY

Steering, demonstrated with a trillion-parameter model like 'Kimmy K2', allows for real-time modification of model behavior through concept manipulation. This goes beyond stylistic changes, aiming for more sophisticated control. Goodfire sees steering and related techniques as moving towards intentional AI design, contrasting with purely data-driven or reactive training methods like RL. The goal is to enable experts to impart desires and control into models, creating a more robust human-AI interface, especially for mission-critical applications.

BROADENING HORIZONS: INTERPRETABILITY ACROSS DOMAINS

Goodfire's interpretability work extends beyond language models to visual domains (world models, diffusion models) and scientific applications. In the healthcare sector, they partner with institutions like Mayo Clinic to extract knowledge from complex biological models, aiming to accelerate drug discovery and disease treatment. The underlying ML techniques are transferable across these domains, even handling different data modalities like 3D scans in medical imaging. The core challenge remains making opaque models legible and controllable, especially in risk-averse fields like healthcare.

THE FUTURE OF INTERPRETABILITY AND AI SAFETY

The field of interpretability is rapidly advancing, with a growing community and accessible research resources. Goodfire emphasizes a grounded approach to AI safety and alignment, focusing on scalable oversight and human involvement. They believe interpretability is key to technical alignment challenges, ensuring models behave as intended. The broader community, spanning academia and frontier labs, shares the goal of increasing model understanding, facilitating more reliable control and robust deployment of advanced AI systems.

Common Questions

Interpretability in AI refers to methods and techniques that aim to provide an understanding of what's happening inside a machine learning model's internal workings, moving beyond a 'black box' approach. It seeks to explain how a model arrives at its outputs from its inputs.

Topics

Mentioned in this video

conceptSAEs (Sparse Autoencoders)

A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.

personNeil Nanda

Associated with the Gemini team, who has provided notebooks on how to perform interpretability tasks.

personLee Shy

Author of the 'Open Problems in Interpretability' paper and works with Goodfire in their London office, contributing to the industry mechan conference.

organizationChan Zuckerberg Initiative

Mentioned in the context of other podcasts discussing AI and its potential in research.

personDana Warcraft

A fellow at Goodfire who contributed to the paper on the equivalence of in-context learning and activation steering.

personEric Bigelow

A fellow at Goodfire who contributed to the paper on the equivalence of in-context learning and activation steering.

softwareNeuroNedia

A tool mentioned for its excellent work in visualizing neural network components.

conceptDouble Descent

A phenomenon in machine learning where model performance initially improves, then degrades, and then improves again as model complexity increases. Discussed in the context of generalization vs. memorization and how interpretability might help navigate it.

conceptSubliminal Learning

A phenomenon where models learn hidden biases even when not explicitly trained on biased data, observed through training on distilled data. Discussed as a worrying area where interpretability is needed.

organizationMATS (Machine Learning and Alignment Theory Scholars)?

A program, possibly an internship, that has been a pathway for many full-time staff joining Goodfire and is a great starting point for those transitioning into interpretability.

organizationArc Institute

An organization partnered with Goodfire for life sciences work.

companyTwo Sigma

Company where Myra Deng previously worked.

softwareTinker

An open-source library from Thinking Machines that uses Rank One LoRA for fine-tuning and RL, mentioned as an alternative to steering for model adaptation.

organizationGoodfire

An AI research lab that focuses on using interpretability to understand, learn from, and design AI models, aiming to unlock the next frontier of safe and powerful AI.

concept40 Glazegate

Referred to as a major AI controversy from the previous year, implying unintended consequences from model post-training processes.

personYan Leica

Co-poser of the question about scaling weak-to-strong generalization in AI.

personMark Bissell

Co-founder of Goodfire, previously worked at Palantir on the healthcare team.

bookArrival (movie)

Based on Ted Chang's short story 'Story of Your Life', it explores themes of alien intelligence and communication, relevant to the AI-human interface problem.

bookExhalation

A short story by Ted Chang about a robot performing interpretability on its own mind, highly relevant to the field.

companyPrima

A startup focused on neurodegenerative disease that partners with Goodfire, using interpretability to find novel biomarkers for Alzheimer's.

conceptCCP bias

A political bias observed in some models, referring to a bias related to the Chinese Communist Party, which Goodfire aims to extract or remove using interpretability techniques.

personMyra Deng

Co-founder of Goodfire, previously worked at Two Sigma. Currently Head of Product.

organizationICML

International Conference on Machine Learning, where Goodfire noted the 'actionable interpretability' theme at a workshop.

softwareKammy K2

Mentioned as a one trillion parameter model on which steering was demonstrated. Likely a large language model.

companyRocketin

A Japanese e-commerce company using Goodfire's interpretability techniques for guardrailing and inference time monitoring of their language models to detect PII.

softwareMedMA

A medical domain model, with MedMA 1.5 mentioned, trained on 3D scans and medical knowledge, highlighting the application of transformer architectures in specialized fields.

toolGemma

More from Latent Space

View all 63 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free