Hypothesis

CXL solves this by introducing a shared memory pool that expands capacity beyond GPU memory, enabling KV reuse across workers. With CXL's load/store semantics, KV data can be accessed with zero-copy, reducing recomputation, stabilizing latency, and significantly lowering cost per token.

Zero-copy access via CXL's load/store semantics is the key architectural advantage over existing solutions like CPU offloading or NVMe storage, which require serialization and DMA transfers. This makes CXL memory behave like extended GPU memory rather than a slower storage tier, preserving latency-sensitive inference performance.

cxl zero-copy cost-per-token

Tags

Annotators

URL