Group query attention
Concept
An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.
Mentioned in 1 video
An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.