w

windowed attention

ConceptMentioned in 1 video

A workaround for long context limitations in transformers that discards information from certain layers, leading to degraded performance on older context.