1 Matching Annotations
  1. Jun 2026
    1. As GLM-5.2 extends the maximum context length from 200K to 1M tokens, coding workloads are expected to shift substantially toward longer prompts. This shifts the primary inference bottleneck from computation to KV-cache capacity, long-context kernel overhead, and CPU-side overhead.

      大多数人认为随着上下文长度的增加,计算复杂度会成为主要瓶颈,但作者认为实际瓶颈在于KV缓存容量和CPU开销。这一挑战了AI领域'计算复杂度是主要瓶颈'的共识,表明在长上下文场景中,内存管理和系统优化可能比算法优化更重要。