Yarn scaling

Concept

A technique applied during pre-training in GPT OSS to achieve a 131,000 token context window by scaling rotary positional embeddings.

Mentioned in 1 video