Group query attention

Concept

An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.

Mentioned in 1 video

Videos Mentioning Group query attention

Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447

Lex Fridman

An efficient attention scheme used in Transformer models that reduces the size of the KV cache by using fewer heads for keys and values while preserving query heads, improving inference speed.