SAEs (Sparse Autoencoders)

Concept

A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.

Mentioned in 1 video