Multi-query attention

Concept

The most aggressive efficient attention scheme in Transformer models, which uses only one key-value head, significantly reducing KV cache size and improving inference speed for larger batch sizes.

Mentioned in 2 videos

Save the 2 videos on Multi-query attention to your own pod.

Sign up free to keep building your knowledge base on Multi-query attention as more episodes are added.

Get Started Free