SAPE

Concept

A technique in interpretability that decomposes a model's representation into interpretable primitives, like concepts firing for specific entities.

Mentioned in 1 video