M
Multi-query attention
ConceptMentioned in 1 video
The most aggressive efficient attention scheme in Transformer models, which uses only one key-value head, significantly reducing KV cache size and improving inference speed for larger batch sizes.
The most aggressive efficient attention scheme in Transformer models, which uses only one key-value head, significantly reducing KV cache size and improving inference speed for larger batch sizes.