G
Group query attention
ConceptMentioned in 1 video
An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.
An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.