S
SAEs (Sparse Autoencoders)
ConceptMentioned in 1 video
A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.
