This article examines the evolution of KVCache in large language model inference, explaining how GQA, MLA, CSA/HCA, Gated DeltaNet, and related approaches compress GPU memory usage, improve throughput, and ease deployment pressure in long-context and multi-turn agent workloads. Keywords: KVCache, long context, inference optimization.
Technical Specifications Snapshot
| Parameter | Details |
|---|---|
| Domain | LLM inference architecture / KVCache optimization |
| Core Target | Decoder-only Transformer |
| Key Protocols / Mechanisms | Attention, GQA, MLA, Sparse Attention, Linear Attention |
| Reference Models | Qwen3-72B, DeepSeek-V3, Qwen3.5-397B, DeepSeek-V4 |
| Quantization Methods | FP8, FP4, vector quantization |
| Data Characteristics | 1M-token long context, per-request KVCache comparison |
| Source Popularity | The original article is a long-form technical analysis with multiple diagrams and model comparisons |
| Core Dependencies | CUDA kernels, inference frameworks, memory bandwidth, sparse scheduling |
KVCache has become the primary bottleneck in long-context inference
In decoder-only models, the component that continuously grows is the attention K/V cache. As multi-turn conversations, agent call chains, and long-document question answering become standard workloads, KVCache quickly shifts from an accelerator to the main memory burden.
This is not a single-point optimization problem. At its core, it is an ongoing tradeoff across compute, storage, and communication resources. Every inference optimization strategy ultimately answers the same question: how do you serve more tokens and more concurrent requests per GPU without materially degrading model quality?
AI Visual Insight: This header image highlights the article’s central theme: KVCache memory drops from hundreds of gigabytes to single-digit gigabytes. The implicit focus is architectural cost reduction for long-context inference, not merely operator-level micro-optimization.
The core formula for estimating KVCache size is straightforward
# KVCache estimation formula for traditional attention
# 2 means one copy each for K and V
kv_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * dtype_size
print(f"KVCache bytes = {kv_bytes}") # Print total memory usage
This formula shows that the number of layers, number of KV heads, head dimension, sequence length, and precision all scale KVCache linearly.
Mainstream evolution paths all reduce the cache cost per token
The first path compresses the number of KV heads: MHA → MQA → GQA. MHA keeps independent K/V pairs for every query head. It offers strong expressiveness, but it also produces the largest cache. MQA makes all query heads share a single set of K/V tensors, which sharply reduces memory usage but can more easily hurt quality.
GQA is a more balanced engineering compromise. Multiple query heads share a smaller number of KV heads. For example, 64 query heads paired with 8 KV heads can reduce KVCache to one-eighth of the original size while preserving strong model quality.
AI Visual Insight: This figure shows the head-sharing relationships in MHA, MQA, and GQA. The key technical point is that the mapping between query heads and KV heads evolves from one-to-one into fully shared or group-shared layouts, directly changing the number of cached elements per token in each layer.
MLA replaces full K/V tensors with low-dimensional latent variables
If GQA compresses the number of heads, MLA compresses the representation dimension. Instead of caching full K/V tensors, it jointly maps them into a low-dimensional latent vector and decompresses them during computation.
Take DeepSeek-V3 as an example. A traditional design stores 14,336 dimensions per token per layer, while MLA stores only 576 dimensions. That is roughly a 96% reduction per layer. In other words, the architecture solves the cache problem upfront instead of relying on post hoc quantization as a remedy.
hidden_size = 7168
mha_kv_dim = hidden_size * 2 # Total K/V dimensions per token in traditional MHA
mla_kv_dim = 512 + 64 # MLA caches only the latent dimensions and RoPE-related dimensions
compress_ratio = 1 - mla_kv_dim / mha_kv_dim
print(round(compress_ratio, 4)) # Print compression ratio
This example shows that MLA’s main benefit comes from changing what gets cached, not simply reducing the number of bits.
Sparse attention starts by reducing the number of tokens that must be retained
Compressing head count and representation dimension only shrinks the cache footprint of each token. Sparse attention asks a more fundamental question: do all historical tokens deserve to be retained? Clearly, they do not.
SWA provides the simplest engineering answer. It keeps only a recent window, such as 128 tokens or another fixed length, and discards everything outside that window. This gives stable memory usage and fast execution, but it can significantly weaken long-range dependencies.
NSA and CSA/HCA turn selective retention into a systematic architecture
NSA matters because it acknowledges that long-range information is still important, but only a highly relevant subset must be retained. It uses learned sparse indexing to select critical tokens and avoids full-history caching and full-history retrieval.
DeepSeek-V4’s CSA+HCA represents a more mature hierarchical hybrid design. Nearby tokens are preserved precisely with a sliding window. Distant tokens are compressed first and then handled at different granularities across layers. CSA compresses at roughly 4:1, while HCA compresses at roughly 128:1, balancing detail and capacity.
AI Visual Insight: This figure shows the combination of local sliding windows, long-range compression, and cross-layer heterogeneous strategies. The key detail is that different layers do not use a single uniform compression ratio. Instead, they apply multiscale caching to match information density at different distances.
def csa_hca_memory(seq_len, kv_dim=512):
csa_part = 30 * kv_dim * seq_len / 4 # CSA layers stored at a 4:1 ratio
hca_part = 31 * kv_dim * seq_len / 128 # HCA layers stored at a 128:1 ratio
window_part = 61 * kv_dim * 128 # Fixed sliding-window overhead across all layers
return csa_part + hca_part + window_part
This pseudocode shows how V4-style designs reconstruct the overall KVCache layout through a compressed segment plus a fixed sliding-window segment.
Linear attention attempts to eliminate O(n) cache growth at the root
Whether you use GQA, MLA, or sparse attention, the cache still depends on sequence length. The coefficient is smaller, but the dependence remains. Linear attention tries to replace token-by-token accumulated K/V tensors with a fixed-size hidden state.
Mamba and state space models (SSMs) represent this direction. They recursively compress historical information into a hidden state, changing cache complexity from O(n) to O(1). The advantage is that memory usage and generation speed no longer degrade linearly with context length. The drawback is weaker precise retrieval of long-range details.
Gated DeltaNet and hybrid architectures are closer to production reality
Gated DeltaNet goes further by introducing explicit forgetting and corrective write mechanisms. It uses a fixed-size state matrix to maintain associative key-value memory. Because that state carries more information than a one-dimensional state vector, its long-text performance is closer to full attention.
Hybrid architectures such as Qwen3.5 do not fully abandon attention. Instead, they use a design with mostly linear-attention layers and a smaller number of full-attention layers. This preserves precise retrieval in critical layers while reducing total KVCache to roughly one-quarter of a pure attention model.
AI Visual Insight: This figure shows linear attention replacing the traditional sequential KV cache with a fixed hidden state. The core signal is that the cache structure changes from an array that grows with every token into a fixed-size state container, which fundamentally changes the complexity model.
# KVCache estimate for a hybrid architecture
# Only full-attention layers keep complete K/V tensors
kv_bytes = 2 * num_full_attn_layers * num_kv_heads * head_dim * seq_len * batch_size * dtype_size
This shows that the key cost-saving mechanism in hybrid architectures is not compressing every layer, but ensuring that most layers do not generate a traditional KVCache at all.
Quantization, cross-layer sharing, and engineering constraints jointly determine the final gains
When changing the model architecture is impractical, the most mature option is still FP8 or FP4 quantization. FP8 is already close to a production standard and can usually cut KVCache in half while remaining compatible with mainstream inference frameworks.
More aggressive directions include vector quantization and CLA-style cross-layer sharing. Once per-element bit width approaches its practical limit, vector quantization moves to encoding entire vectors. Cross-layer sharing is based on the observation that adjacent layers often have highly similar KV representations, allowing some layers to reuse cached tensors directly.
Results at 1M context already make the trend clear
At the same 1M-token context length, Qwen3-72B with GQA requires about 305GB of KVCache, DeepSeek-V3 with MLA requires about 65GB, Qwen3.5’s hybrid architecture needs about 28.6GB, and DeepSeek-V4 reduces that further to 7.4GB.
This is not just about saving memory. It determines whether a single GPU can support long context, whether concurrency can increase, whether agent sessions can remain persistent, and whether the inference system is commercially deployable.
AI Visual Insight: This chart compares total KVCache size across different models at 1M context. It directly shows the order-of-magnitude differences among GQA, MLA, hybrid architectures, and CSA/HCA, emphasizing the decisive impact of architecture design on deployment cost.
The most important trend today is mechanism fusion
The industry has not converged on a single path. Sparse attention is strong at selective retrieval over medium and long distances. Linear attention is strong at constant-size caching. Sliding windows are strong at high-fidelity local modeling. Quantization improves nearly every path.
The most realistic next step is not for one method to replace all others, but to combine multiple cache strategies within the same model: local fidelity, long-range compression, precise critical layers, and recurrent global state, each serving a different information scale.
FAQ: The three questions developers care about most
Why does KVCache become the first bottleneck in LLM inference?
Because it grows linearly with layer count, sequence length, KV head count, and precision. In long-context and multi-turn dialogue workloads, KVCache can easily consume more than 60% of inference memory, directly limiting concurrency and context length.
If I can choose only one safe optimization, what should I prioritize?
In practice, the usual priority is FP8 KVCache quantization plus GQA. Both have low implementation cost, strong framework support, and stable returns, making them the best starting point for practical memory reduction.
Between sparse attention and linear attention, which path has more future potential?
Both matter. Sparse attention stays closer to Transformer semantics and makes precise retrieval easier to preserve. Linear attention has a better chance of breaking the O(n) memory constraint. In the short to medium term, hybrid architectures are more likely to win than any single pure approach.
Core Summary
This article systematically maps the evolution of LLM KVCache optimization from GQA and MLA to sparse attention, linear attention, and quantization-based compression. It explains why memory usage at 1M context can drop from 305GB to 7.4GB and provides a practical framework for engineering tradeoff decisions.