What is Goodfire AI and what do they do?

Goodfire AI is an AI research lab that focuses on using interpretability to understand, learn from, and design AI models. They believe interpretability is key to unlocking the next generation of safe and powerful AI models and are applying these techniques in production scenarios.

What are the main applications of interpretability discussed?

Interpretability is applied to various areas including data curation during training, understanding post-training behavior like unintended side effects (e.g., sycophancy, reward hacking), diagnosing issues in high-stakes industries, and making surgical edits to models, such as removing political bias or unlearning specific behaviors.

What problems have been found with SAEs (Sparse Autoencoders)?

While SAEs are useful, probes trained on raw activations sometimes perform better than SAE probes for detecting harmful behaviors like hallucinations or intent. This suggests that the concept space described by SAEs isn't always as clean or accurate as expected for real-world performance metrics.

How does activation steering work and what are its capabilities?

Activation steering involves manipulating specific features or concepts within a model's internal representations in real-time to influence its output. A demonstration showed steering a trillion parameter model to speak in Gen Z slang, highlighting its potential for controlling model behavior, though it's still evolving.

What is the relationship between activation steering and prompting?

Research suggests that activation steering and in-context learning (prompting) can be formally equivalent under certain conditions. Papers explore how to quantify the steering needed to achieve effects similar to specific prompting strategies, even for complex behaviors like jailbreaks.

Is interpretability research computationally expensive?

No, interpretability research, particularly techniques like training probes, is relatively affordable, often costing in the thousands of dollars, not millions. This makes it an approachable field for researchers and engineers, even those without massive compute resources.

How is Goodfire AI applying interpretability in healthcare and science?

Goodfire partners with institutions like Mayo Clinic to use interpretability on life sciences models. This includes debugging genomics models to ensure they focus on biologically relevant factors and identifying novel biomarkers for diseases like Alzheimer's.

What are 'world models' and why are they relevant to interpretability?

World models aim to create consistent internal representations of how the world operates. They are of interest for interpretability because concepts in world models (like images or video) are often easier to visualize and understand than abstract language concepts, potentially speeding up the feedback cycle.

What is the role of interpretability in AI safety and alignment?

Interpretability is crucial for AI safety by enabling humans to have scalable oversight into model behavior, helping to build models that reliably do what we intend. It aids in addressing technical alignment challenges and ensuring robust control over increasingly capable AI systems.

What are open questions or areas for future research in interpretability?

Key open areas include developing more sophisticated interventions for complex reasoning tasks beyond stylistic changes, achieving precise control over model behaviors without negative side effects, and applying interpretability to domains like video models, robotics, and scientific discovery.

Key Moments

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Latent Space Podcast

Science & Technology3 min read69 min video

Feb 5, 2026|3,737 views|83|9

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Goodfire AI pioneers mechanistic interpretability for safer, more powerful AI, securing $150M Series B.

Key Insights

Interpretability is broadly defined, encompassing more than just post-hoc analysis; it aims for intentional AI design throughout the lifecycle.

Goodfire focuses on applying interpretability to real-world problems, translating frontier research into production workflows and APIs.

Mechanistic interpretability has applications beyond language models, including scientific discovery (healthcare, material science) and visual domains (world models, diffusion).

Techniques like Steering and SAEs (Sparse Autoencoder) aim to provide surgical control over model behavior, though challenges like feature representation and transferability exist.

The field of interpretability is rapidly growing, with a lower barrier to entry, attracting talent from diverse scientific backgrounds.

There's a strong emphasis on scalable oversight and human involvement in the future of AI, aiming for intentional design rather than solely reactive training methods.

DEFINING INTERPRETABILITY AND GOODFIRE'S MISSION

Interpretability, a term with diverse definitions, is central to Goodfire AI's mission. Goodfire views itself as an AI research company specializing in interpretability methods to understand, learn from, and design AI models. Their vision extends beyond a black-box approach, aiming to bring interpretability to the entire AI development lifecycle, from data curation during training to understanding internal representations post-training. This perspective positions interpretability as key to unlocking the next frontier of safe and powerful AI.

FROM RESEARCH TO PRODUCTION: GOODFIRE'S APPROACH

Goodfire bridges the gap between academic interpretability research and practical, real-world applications. They focus on developing repeatable production workflows and APIs, moving interpretability out of the research lab and into enterprise deployments. This involves deep engagement with customers to understand pressing issues and then applying cutting-edge interpretability techniques. Failures and shortcomings encountered in these applications inform their research agenda, driving advancements in areas like foundational interpretability models and control mechanisms.

APPLICATIONS: ADDRESSING CHALLENGES AND UNLOCKING NEW CAPABILITIES

Interpretability aims to tackle various AI challenges, from unintended side effects like reward hacking and hallucinations to specific biases (e.g., political bias). Techniques like activation steering can be used for 'surgical edits,' allowing for precise modifications of model behavior. This is crucial for tasks like unlearning undesirable traits or enhancing desired ones. Furthermore, interpretability is vital for scientific discovery, enabling understanding of complex models in domains like genomics and medical imaging to identify novel biomarkers or validate biological relevance.

EXPLORING INTERPRETABILITY TECHNIQUES AND CHALLENGES

Key techniques discussed include Sparse Autoencoders (SAEs) for feature extraction and Probes for classification. While SAEs aim to capture concepts sparsely, challenges arise when their learned feature space isn't as clean or accurate as expected for downstream tasks, especially with clean datasets. Probes, trained on raw activations, sometimes outperform SAE probes in specific detection tasks. The practical application of these techniques, like identifying PII in multilingual e-commerce data, highlights complexities such as synthetic-to-real transfer and token-level precision, demanding robust solutions.

STEERING AND MODEL DESIGN: ENABLING INTENTIONALITY

Steering, demonstrated with a trillion-parameter model like 'Kimmy K2', allows for real-time modification of model behavior through concept manipulation. This goes beyond stylistic changes, aiming for more sophisticated control. Goodfire sees steering and related techniques as moving towards intentional AI design, contrasting with purely data-driven or reactive training methods like RL. The goal is to enable experts to impart desires and control into models, creating a more robust human-AI interface, especially for mission-critical applications.

BROADENING HORIZONS: INTERPRETABILITY ACROSS DOMAINS

Goodfire's interpretability work extends beyond language models to visual domains (world models, diffusion models) and scientific applications. In the healthcare sector, they partner with institutions like Mayo Clinic to extract knowledge from complex biological models, aiming to accelerate drug discovery and disease treatment. The underlying ML techniques are transferable across these domains, even handling different data modalities like 3D scans in medical imaging. The core challenge remains making opaque models legible and controllable, especially in risk-averse fields like healthcare.

THE FUTURE OF INTERPRETABILITY AND AI SAFETY

The field of interpretability is rapidly advancing, with a growing community and accessible research resources. Goodfire emphasizes a grounded approach to AI safety and alignment, focusing on scalable oversight and human involvement. They believe interpretability is key to technical alignment challenges, ensuring models behave as intended. The broader community, spanning academia and frontier labs, shares the goal of increasing model understanding, facilitating more reliable control and robust deployment of advanced AI systems.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Interpretability in AI refers to methods and techniques that aim to provide an understanding of what's happening inside a machine learning model's internal workings, moving beyond a 'black box' approach. It seeks to explain how a model arrives at its outputs from its inputs.

Topics

Interpretability Model Design Steering SAEs Probes Computational Neuroscience

Mentioned in this video

Software & Apps

Kammy K2

Mentioned as a one trillion parameter model on which steering was demonstrated. Likely a large language model.

NeuroNedia

A tool mentioned for its excellent work in visualizing neural network components.

Tinker

An open-source library from Thinking Machines that uses Rank One LoRA for fine-tuning and RL, mentioned as an alternative to steering for model adaptation.

MedMA

A medical domain model, with MedMA 1.5 mentioned, trained on 3D scans and medical knowledge, highlighting the application of transformer architectures in specialized fields.

Gemma

Companies

Rocketin

A Japanese e-commerce company using Goodfire's interpretability techniques for guardrailing and inference time monitoring of their language models to detect PII.

Two Sigma

Company where Myra Deng previously worked.

Prima

A startup focused on neurodegenerative disease that partners with Goodfire, using interpretability to find novel biomarkers for Alzheimer's.

Concepts

SAEs (Sparse Autoencoders)

A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.

Double Descent

A phenomenon in machine learning where model performance initially improves, then degrades, and then improves again as model complexity increases. Discussed in the context of generalization vs. memorization and how interpretability might help navigate it.

Subliminal Learning

A phenomenon where models learn hidden biases even when not explicitly trained on biased data, observed through training on distilled data. Discussed as a worrying area where interpretability is needed.

40 Glazegate

Referred to as a major AI controversy from the previous year, implying unintended consequences from model post-training processes.

CCP bias

A political bias observed in some models, referring to a bias related to the Chinese Communist Party, which Goodfire aims to extract or remove using interpretability techniques.

People

Neil Nanda

Associated with the Gemini team, who has provided notebooks on how to perform interpretability tasks.

Lee Shy

Author of the 'Open Problems in Interpretability' paper and works with Goodfire in their London office, contributing to the industry mechan conference.

Dana Warcraft

A fellow at Goodfire who contributed to the paper on the equivalence of in-context learning and activation steering.

Eric Bigelow

A fellow at Goodfire who contributed to the paper on the equivalence of in-context learning and activation steering.

Yan Leica

Co-poser of the question about scaling weak-to-strong generalization in AI.

Mark Bissell

Co-founder of Goodfire, previously worked at Palantir on the healthcare team.

Myra Deng

Co-founder of Goodfire, previously worked at Two Sigma. Currently Head of Product.

Organizations

Chan Zuckerberg Initiative

Mentioned in the context of other podcasts discussing AI and its potential in research.

MATS (Machine Learning and Alignment Theory Scholars)?

A program, possibly an internship, that has been a pathway for many full-time staff joining Goodfire and is a great starting point for those transitioning into interpretability.

Arc Institute

An organization partnered with Goodfire for life sciences work.

Goodfire

An AI research lab that focuses on using interpretability to understand, learn from, and design AI models, aiming to unlock the next frontier of safe and powerful AI.

ICML

International Conference on Machine Learning, where Goodfire noted the 'actionable interpretability' theme at a workshop.

Books

Arrival (movie)

Based on Ted Chang's short story 'Story of Your Life', it explores themes of alien intelligence and communication, relevant to the AI-human interface problem.

Exhalation

A short story by Ted Chang about a robot performing interpretability on its own mind, highly relevant to the field.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free