SAPE
Concept
A technique in interpretability that decomposes a model's representation into interpretable primitives, like concepts firing for specific entities.
Mentioned in 1 video
A technique in interpretability that decomposes a model's representation into interpretable primitives, like concepts firing for specific entities.