Sparse Autoencoders

Software / AppMentioned in 1 video

A technique used in mechanistic interpretability to identify features within a language model by reconstructing MLP layer outputs with sparsity and expansion factors.