G

Group query attention

ConceptMentioned in 1 video

An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.