Subliminal Learning
Concept
A phenomenon where models learn hidden biases even when not explicitly trained on biased data, observed through training on distilled data. Discussed as a worrying area where interpretability is needed.
Mentioned in 1 video
