Group query attention

Concept

An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.

Mentioned in 1 video