MLA (Multi-latent attention)

Concept

An algorithm from DeepSeek that turns keys and values into a single latent vector, expanded during inference. It offers a way to reduce KV cache size while maintaining richness.

Mentioned in 1 video