Multi-query attention
Concept
The most aggressive efficient attention scheme in Transformer models, which uses only one key-value head, significantly reducing KV cache size and improving inference speed for larger batch sizes.
Mentioned in 1 video
The most aggressive efficient attention scheme in Transformer models, which uses only one key-value head, significantly reducing KV cache size and improving inference speed for larger batch sizes.