Hypothesis

Instead, with IndexShare, the KV cache of $h_5$ includes only $kv_{1:4}$, all from the hidden states of the target model. For training, we reuse both kv cache and topk indices of the first mtp step.

大多数人认为在多步预测解码中，每一步都应该独立计算KV缓存以保持信息完整性，但作者认为通过共享索引可以消除训练-推理差异，提高接受率。这一反直觉的观点挑战了模型推理的最佳实践，表明在某些情况下，限制信息流动可能反而提高模型性能。

counterintuitive index-share inference-optimization

Tags

Annotators

URL