S

SAEs (Sparse Autoencoders)

ConceptMentioned in 1 video

A type of interpretability technique that involves training an autoencoder to be sparse, aiming to capture concepts cleanly. Discussed as a foundational element but with noted shortcomings compared to raw activations in some tasks.