Multi-query attention

Concept

The most aggressive efficient attention scheme in Transformer models, which uses only one key-value head, significantly reducing KV cache size and improving inference speed for larger batch sizes.

Mentioned in 1 video