windowed attention

Concept

A workaround for long context limitations in transformers that discards information from certain layers, leading to degraded performance on older context.

Mentioned in 1 video