SAEs (Sparse Autoencoders)

Concept

A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.

Mentioned in 1 video

Videos Mentioning SAEs (Sparse Autoencoders)

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Latent Space

A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.