Subliminal Learning

Concept

A phenomenon where models learn hidden biases even when not explicitly trained on biased data, observed through training on distilled data. Discussed as a worrying area where interpretability is needed.

Mentioned in 1 video

Videos Mentioning Subliminal Learning

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Latent Space

A phenomenon where models learn hidden biases even when not explicitly trained on biased data, observed through training on distilled data. Discussed as a worrying area where interpretability is needed.