3 Matching Annotations
  1. Apr 2026
    1. TPU 8i is designed with more memory bandwidth to serve the most latency-sensitive inference workloads, which is critical because interactions between agents at scale magnify even small inefficiencies.

      通常认为内存带宽是通用硬件的需求,但作者提出TPU 8i针对低延迟推理进行了优化,这与通用硬件设计追求平衡的常规做法不同。

    1. A 606 MiB model at ~49 tokens/s consumes ~30 GB/s of memory bandwidth, close to the c6i.2xlarge's DRAM limit. No amount of SIMD tricks will help when the CPU is stalled waiting for model weights to arrive from DRAM.

      这一数据揭示了现代CPU推理的关键瓶颈:内存带宽限制。代理最初尝试的SIMD微优化无法突破这一根本限制,这表明理解硬件特性和系统瓶颈对于有效优化至关重要。这一发现挑战了传统上认为计算是主要瓶颈的观念,强调了内存效率在AI推理中的核心地位。

  2. Apr 2022

    Tags

    Annotators