2 Matching Annotations
- Nov 2021
two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).
It partitions optimizer state, gradients and parameters across multiple data parallel processes via a dynamic communication schedule to minimize the communication volume.