the KV cache has emerged as a primary performance and cost bottleneck because its size grows rapidly with context length and batch size. Limited GPU memory forces frequent recomputation, cache eviction, or spilling to storage
This precisely quantifies why longer context windows are expensive beyond just model size: KV cache grows quadratically with context, and current GPU memory can't keep pace. Each eviction or recomputation directly inflates cost-per-token — making KV cache the hidden tax on long-context AI workloads.