M

Multi-query attention

ConceptMentioned in 1 video

The most aggressive efficient attention scheme in Transformer models, which uses only one key-value head, significantly reducing KV cache size and improving inference speed for larger batch sizes.