4 Matching Annotations
  1. Jun 2022
    1. the high numberof compute operations required can result in unrealistically longtraining times (e.g., training GPT-3 with 175 billion parameters [11 ]would require approximately 288 years with a single V100 NVIDIAGPU).

      GPT3训练成本

    1. As a concrete measure, we suggest reporting the total number of floating point operations (FPO) required togenerate a result.13 FPO provides an estimate to the amount of work performed by a computational process. It iscomputed analytically by defining a cost to two base operations, ADD and MUL. Based on these operations, the FPOcost of any machine learning abstract operation (e.g., a tanh operation, a matrix multiplication, a convolution operation,or the BERT model) can be computed as a recursive function of these two operations. FPO has been used in the pastto quantify the energy footprint of a model [26, 42, 12, 41], but is not widely adopted in AI

      FLOPs的介绍

    1. For example, NVIDIA estimated that 80–90% of the ML workload is inference processing [Leo19]. Similarly,Amazon Web services claimed that 90% of the ML demand in the cloud is for inference [Bar19].

      inference整体能耗论据

    1. A clear recent trend in the AI community is that models are getting significantly larger. It only took 3 months to shift the title of the largest model from BERT-Large to GPT-2 (Radford et al. 2019) in 2020 while the number of parameters of GPT-2 is around 5 times larger than that of BERT-Large. Moreover, GPT-2 further evolves into GPT-3 (Brown et al. 2020) with 175 Billion parameters. More recently, GLM (Du et al. 2021) has clinched the title with surprisingly 1.75 Trillion parameters. These large models consume more data and have better performance than their smaller counterparts

      AI模型不断变大的发展趋势