Hypothesis

3 Matching Annotations

Jun 2026
xcena.com xcena.com

Untitled document

1
1. fxp007 05 Jun 2026
  
  in Public
  
  the KV cache has emerged as a primary performance and cost bottleneck because its size grows rapidly with context length and batch size. Limited GPU memory forces frequent recomputation, cache eviction, or spilling to storage
  
  This precisely quantifies why longer context windows are expensive beyond just model size: KV cache grows quadratically with context, and current GPU memory can't keep pace. Each eviction or recomputation directly inflates cost-per-token — making KV cache the hidden tax on long-context AI workloads.
  
  kv-cache llm-inference gpu-memory
Visit annotations in context

Tags

gpu-memory

kv-cache

llm-inference

Annotators

fxp007

URL

xcena.com/sdk_overview
Nov 2021
lilianweng.github.io lilianweng.github.io

How to Train Really Large Models on Many GPUs?

2
1. sherlockliao 09 Nov 2021
  
  in Public
  
  two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).
  
  深度网络训练中的显存开销主要是哪些？
  
  GPU memory amp
2. sherlockliao 09 Nov 2021
  
  in Public
  
  It partitions optimizer state, gradients and parameters across multiple data parallel processes via a dynamic communication schedule to minimize the communication volume.
  
  ZeRO-DP 的原理是什么？
  
  ZeRO GPU memory Data Parallel
Visit annotations in context

Tags

GPU memory

Data Parallel

amp

ZeRO

Annotators

sherlockliao

URL

lilianweng.github.io/lil-log/2021/09/24/train-large-neural-networks.html

Tags

Annotators

URL

Tags

Annotators

URL