SAEs (Sparse Autoencoders)
Concept
A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.
Mentioned in 1 video
