Hypothesis

3 Matching Annotations

Jun 2026
xcena.com xcena.com

Untitled document

1
1. fxp007 05 Jun 2026
  
  in Public
  
  the KV cache has emerged as a primary performance and cost bottleneck because its size grows rapidly with context length and batch size. Limited GPU memory forces frequent recomputation, cache eviction, or spilling to storage
  
  This precisely quantifies why longer context windows are expensive beyond just model size: KV cache grows quadratically with context, and current GPU memory can't keep pace. Each eviction or recomputation directly inflates cost-per-token — making KV cache the hidden tax on long-context AI workloads.
  
  kv-cache llm-inference gpu-memory
Visit annotations in context

Tags

gpu-memory

kv-cache

llm-inference

Annotators

fxp007

URL

xcena.com/sdk_overview
Apr 2026
x.com x.com

https://x.com/berryxia/status/2042017501253661059

1
1. fxp007 16 Apr 2026
  
  in Public
  
  KV Cache 内存占用降低 10.7 倍
  
  令人惊讶的是：KV Cache内存占用降低了惊人的10.7倍，这一数字远超普通技术优化的幅度。KV Cache是大模型推理中的主要内存消耗部分，如此大幅度的减少意味着同样的硬件可以处理更长的上下文，或者同时运行更多模型实例。
  
  surprising memory-efficiency kv-cache-optimization
Visit annotations in context

Tags

kv-cache-optimization

surprising

memory-efficiency

Annotators

fxp007

URL

x.com/berryxia/status/2042017501253661059
Aug 2025
arxiv.org arxiv.org

2410.07590v1.pdf

1
1. zhurovandrew 03 Aug 2025
  
  in Public
  
  the compu-tation cost of KV caches is quadratic to the input sequence length
  
  KV cache compute cost
Visit annotations in context

Tags

KV cache

compute cost

Annotators

zhurovandrew

URL

arxiv.org/pdf/2410.07590

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL