Sparse Autoencoders

Software / App

A technique used in mechanistic interpretability to identify features within a language model by reconstructing MLP layer outputs with sparsity and expansion factors.

Mentioned in 1 video