How can we determine which time step caused a failure in a trajectory?

One method is a 'leave-one-out' analysis, where you re-simulate the trajectory by zeroing out the noise at each individual time step. If removing noise at a specific step leads to a successful trajectory, that step is a likely contributor to the failure. This can be extended to groups of time steps if single interventions are insufficient.

What are Shapley values and how are they used for explainability?

Shapley values, originating from game theory, attribute the contribution of each feature (or noise step) to the overall outcome by considering all possible combinations of features. They provide a numerical value for each feature's impact on the system's performance or failure.

What is a limitation of using Shapley values for complex systems?

The primary limitation is computational complexity. For systems with many features or time steps (like a trajectory with 40 steps), calculating Shapley values involves factorial computations (e.g., 40!), making it infeasible for high-dimensional problems.

How can policy visualization help explain model failures?

By plotting the policy's behavior across the state space, one can identify 'dead zones' or regions where the model behaves unexpectedly or poorly. This can reveal why a model might fail in specific scenarios, especially when contrasted with areas where it performs well.

What is the 'Clever Hans' effect in AI?

The 'Clever Hans' effect refers to AI models learning spurious correlations or relying on irrelevant features (like a timestamp or background color) to make predictions, rather than understanding the true underlying concepts, akin to a horse reacting to subtle cues from its owner.

How does Integrated Gradients improve upon basic gradient saliency maps?

Integrated Gradients addresses the issue of vanishing gradients by interpolating between a baseline input (e.g., a black image) and the actual input. It aggregates gradients along this path, providing a more robust attribution of feature importance, especially in deep neural networks.

What is Grad-CAM and why is it useful for computer vision explainability?

Grad-CAM generates feature-level explanations by analyzing gradients from semantic layers in CNNs or vision transformers. It highlights regions relevant to the prediction at a higher conceptual level than pixel-level saliency maps, aiding in understanding what the model is 'looking at'.

What challenges arise when trying to ban sensitive features like ethnicity in LLMs?

Even if a feature like ethnicity is removed from direct input, LLMs might implicitly reconstruct it from other correlated features (e.g., zip code). This requires understanding the model's internal representations to ensure compliance and fairness.

What is mechanistic interpretability?

Mechanistic interpretability aims to understand AI models not just by their input-output behavior but by uncovering the specific mechanisms and causal pathways within the model that lead to a particular output, often by identifying conceptual representations and their relationships.

How do LLMs represent concepts internally?

Current understanding suggests LLMs represent concepts not as single scalar values in an embedding, but as 'directions' in high-dimensional space. These directions can be combined sparsely, with each direction potentially corresponding to a concept like 'ethnicity' or 'Golden Gate Bridge'.

What role do sparse autoencoders play in LLM interpretability?

Sparse autoencoders are used to discover these conceptual directions (basis vectors) within LLM embeddings. By forcing a compressed representation through a bottleneck with an L1 penalty, they encourage sparsity, allowing researchers to identify and potentially isolate individual concepts represented by the model.

Key Moments

Stanford AA228V I Validation of Safety Critical Systems I Explainability

Stanford Online

Education6 min read74 min video

Apr 10, 2026|231 views|19

Stanford Stanford Online Robotics

Save to Pod

Key Moments

TL;DR

AI models can develop internal biases, like reconstructing ethnicity even when it's not an input feature, requiring advanced causal analysis beyond simple explanations to ensure safety and fairness.

Key Insights

The Shapley value, a concept from game theory, can quantify the contribution of individual features or time steps to an outcome, but struggles with high-dimensional spaces due to its combinatorial complexity (40 factorial operations).

Policy visualization, by plotting state spaces and policy outputs, can reveal 'dead zones' where models perform poorly, although it's limited by the dimensionality of the state space.

Simple gradient-based saliency maps for image classification often fail due to numerical issues and the localized nature of gradients, leading to less informative visualizations.

Integrated gradients, by interpolating between a baseline and the actual input, offer an improvement over simple gradients for feature attribution in vision models and can be adapted for LLMs by operating on token embeddings.

Mechanistic interpretability aims to understand AI models by identifying internal concepts (like 'ethnicity') represented as 'directions in space' within high-dimensional embeddings, moving beyond correlation to causation.

Sparse autoencoders are used to find these 'dictionary elements' representing concepts, enabling the construction of causal graphs that can potentially reveal and mitigate hidden biases within LLMs.

Attributing failure to noise through Shapley values

The lecture begins by discussing the challenge of attributing failures in AI systems, using a cart-pole example where random noise can lead to system collapse. To understand which noise instances were most critical, a 'leave one out' analysis is proposed. This evolves into the concept of Shapley values, borrowed from game theory, which assigns a numerical value to each feature (or noise instance in this case) based on its contribution to the outcome across all possible combinations of features. While powerful for understanding contributions, the computational complexity of Shapley values, growing factorially with the number of features (e.g., 40 factorial for a 40-step trajectory), makes them impractical for high-dimensional problems. This method aims to answer 'why did this failure happen?' by quantifying the impact of individual elements, and subsequently 'what can we do about it?' by suggesting areas for mitigation. However, the 'so what' is that even theoretically sound attribution methods like Shapley values face significant scalability challenges when applied to real-world complex systems.

Visualizing policy behavior to identify model weaknesses

Policy visualization offers a more intuitive approach, especially for lower-dimensional systems like the cart-pole. By plotting the entire state space and the policy's action at each point, 'dead zones' or areas where the policy behaves erratically can be identified. For instance, failures might occur when the system enters a regime not well-represented in the training data, leading the model to behave unpredictably. This was illustrated with a cart-pole example where the neural network policy, trained via behavioral cloning, was only robust in the central state space, failing when noise pushed it into an unexplored region. The 'so what' here is that while visual methods can provide clear explanations for simple systems, their applicability diminishes rapidly with increasing state-space dimensionality, limiting their use for more complex AI.

From pixel gradients to semantic understanding in vision models

For vision models, early interpretability methods like saliency maps, which use gradients of the output with respect to input pixels, proved to be noisy and difficult to interpret. This is partly due to numerical properties of neural network outputs (like softmax) and the localized nature of pixel gradients. A more robust approach is integrated gradients, which computes gradients along a path from a baseline (e.g., a black image) to the actual input, providing a clearer attribution of important image regions. Further advancements led to methods like Grad-CAM, which uses gradients from later layers of a Convolutional Neural Network (CNN) to generate heatmaps highlighting semantically relevant regions. This moves from pixel-level explanations to more conceptual feature localization, indicating where the model is 'looking' (e.g., focusing on a dog's head or a cat's hindquarters). These methods help answer 'why did it fail?' by showing what the model focused on, and 'what can we do?' by suggesting interventions or data augmentation based on these visualizations. The 'so what' is that interpretability methods are evolving from raw pixel importance to higher-level semantic understanding, offering more actionable insights, but not without their own numerical and theoretical challenges.

Detecting spurious correlations mimicking human biases

A key challenge highlighted is how AI models can learn spurious correlations, akin to the anecdote of 'Clever Hans' the arithmetic horse. For instance, a model might classify birds based on a blue background in its training data or identify locations based on timestamps in images. In safety-critical systems, this can manifest as models relying on non-causal features that coincidentally correlate with desired outcomes, leading to unexpected failures when deployed in different environments. This is particularly concerning in domains like autonomous driving or aviation, where subtle biases can have severe consequences. The 'so what' is that explainability methods must go beyond superficial correlations to uncover underlying causal mechanisms, especially in systems where robustness and fairness are paramount.

The frontier of mechanistic interpretability in LLMs

The lecture pivots to the cutting edge of interpretability, focusing on Large Language Models (LLMs) and the concept of mechanistic interpretability. The core problem is that even if sensitive features like 'ethnicity' are removed from the input, LLMs might still implicitly reconstruct and use this information through internal representations. This requires understanding the 'model's internal world' rather than just its input-output behavior. Mechanistic interpretability seeks to identify specific concepts (like 'ethnicity', 'Golden Gate Bridge', or 'capital of Texas') as 'directions' within the high-dimensional embedding spaces of LLMs. Sparse autoencoders are a key technique used here, attempting to decompose complex embeddings into a sparse set of these concept-representing directions. This allows researchers to build 'causal circuits' within the model, analogous to Bayesian networks, to trace how concepts are activated and influence the final output. The ability to intervene on these internal directions (e.g., 'zero out' the ethnicity direction) offers a path to mitigating learned biases. The 'so what' is that understanding LLMs requires looking 'inside' the black box to map abstract concepts to internal computations, a complex but necessary step for building trustworthy AI.

Causal inference and the challenge of model explanations

The discussion touches upon the difference between correlation and causation, drawing parallels to historical debates about smoking and cancer. While observational data can reveal strong correlations (e.g., between smoking, genetics, and cancer), it struggles to establish causal directionality. Bayesian networks, while useful for modeling statistical dependencies, cannot inherently answer 'what if' intervention questions (e.g., 'what if I ban smoking?'). Causal graphs, on the other hand, aim to model these causal mechanisms, allowing for predictions under interventions and offering robustness to distributional shifts. Applying this to AI, the goal is to move beyond features that merely correlate with outcomes to understanding the causal pathways within a model. This allows for more robust explanations and interventions, enabling us to answer critical questions like 'why did the model make this decision?', 'how can we fix it?', and 'how can we guarantee the fix?' by demonstrating the removal of causal links.

The future: Scaling circuits and formal verification

The lecture concludes by emphasizing that many of these challenges, particularly in mechanistic interpretability and causal inference within LLMs, represent the current frontier of AI research. Questions remain on how to effectively scale these circuit-tracing methods to understand the vast number of concepts and internal pathways within large models. Furthermore, connecting these interpretability insights to formal verification techniques, like the reachability analysis discussed earlier in the course, is a significant open problem. This integration is crucial for providing rigorous guarantees of safety and reliability in AI systems. The 'so what' is that understanding AI, especially at the scale of modern LLMs, requires developing sophisticated tools for mechanistic interpretability and causal inference, which are active areas of research with profound implications for the future of safe and trustworthy AI.

Mentioned in This Episode

●Software & Apps

●Companies

●Books

●Concepts

●People Referenced

Explainability & Interpretability: Key Questions & Methods

Practical takeaways from this episode

Do This

Ask three core questions: Why did it fail? What can be done to mitigate? How to guarantee it won't happen again?

For simple systems, use leave-one-out analysis or policy visualization.

Consider Shapley Values for feature attribution, but be mindful of combinatorial complexity.

For vision models, explore perturbation methods, gradient saliency maps, integrated gradients, and Grad-CAM.

For LLMs, focus on mechanistic interpretability by identifying conceptual directions and using sparse autoencoders.

Always perform sanity checks on explainability methods.

When in doubt, look at worst-case performing samples.

For complex LLM reasoning, investigate circuit tracing and causal models.

Avoid This

Don't assume correlation implies causation, especially with observational data.

Avoid relying solely on Bayesian networks for causal inference or intervention analysis.

Be cautious of explainability methods that fail simple sanity checks with random models or inputs.

Don't expect simple pixel-level or individual feature explanations for complex LLM reasoning.

Common Questions

The three key questions are: Why did the failure happen? What steps can be taken to mitigate this type of problem and understand how to modify the system or data? How can we guarantee to stakeholders that the problem is fixed and won't recur?

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Deep Learning Mechanistic Interpretability Computational Methods Model Verification Feature Attribution

Mentioned in this video

Concepts

Cart Pole System

A system used as a concrete example to illustrate concepts of explainability and interpretability, similar to an inverted pendulum.

AI Squared

A verification technique used for the large system in project three, indicating a method for ensuring system reliability.

Hessian

The second-order derivative used in the Taylor expansion approach for the medium system in project three, indicating a method for improved performance.

Zonotope

A popular approach in neural network certification around 2020-2021, used for the small system in project three, indicating its popularity and application in the field.

Shapley values

A method from game theory used to attribute contributions in group projects, applied here to understand feature importance in trajectories and models.

Media

Clever Hans

A historical horse claimed to perform arithmetic, used as an analogy for AI models learning spurious correlations instead of true reasoning.

MIT Technology Review

Mentioned for its 'Breakthrough Technology 2026' list, which highlights mechanistic interpretability as a key area.

Software & Apps

Grad-CAM

A method for generating feature-level explanations in computer vision models by focusing on semantic layers and positive impacts.

People

Andrej Karpathy

Mentioned as the source of a practical tip on interpreting models by looking at worst-case performing samples.

Judea Pearl

Mentioned for his significant research on causal graphs, relevant to understanding causality beyond mere correlation.

Thomas Icard

The instructor of the CS 221M course on mechanistic interpretability, recommended for further study.

Companies

Waymo

Mentioned as an example of a company where a chief engineer might work, highlighting the context of safety-critical systems and potential failures.

OpenAI

Mentioned as a company working on the frontier of LLM interpretability and developing models like sparse autoencoders.

Anthropic

A company mentioned as working on the frontier of LLM interpretability, particularly with sparse autoencoders and circuit tracing.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free