1 Matching Annotations
  1. Last 7 days
    1. With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of 'goodput', or useful training, while that of other approaches nosedives.

      大多数人认为硬件故障会显著降低分布式训练的效率和性能,但作者认为即使在硬件故障率极高的环境下,Decoupled DiLoCo仍能保持88%的有效训练率,而传统方法则暴跌至27%,这挑战了人们对故障容忍能力的传统认知。