the high numberof compute operations required can result in unrealistically longtraining times (e.g., training GPT-3 with 175 billion parameters [11 ]would require approximately 288 years with a single V100 NVIDIAGPU).
GPT3训练成本
the high numberof compute operations required can result in unrealistically longtraining times (e.g., training GPT-3 with 175 billion parameters [11 ]would require approximately 288 years with a single V100 NVIDIAGPU).
GPT3训练成本
As a concrete measure, we suggest reporting the total number of floating point operations (FPO) required togenerate a result.13 FPO provides an estimate to the amount of work performed by a computational process. It iscomputed analytically by defining a cost to two base operations, ADD and MUL. Based on these operations, the FPOcost of any machine learning abstract operation (e.g., a tanh operation, a matrix multiplication, a convolution operation,or the BERT model) can be computed as a recursive function of these two operations. FPO has been used in the pastto quantify the energy footprint of a model [26, 42, 12, 41], but is not widely adopted in AI
FLOPs的介绍
For example, NVIDIA estimated that 80–90% of the ML workload is inference processing [Leo19]. Similarly,Amazon Web services claimed that 90% of the ML demand in the cloud is for inference [Bar19].
inference整体能耗论据
A clear recent trend in the AI community is that models are getting significantly larger. It only took 3 months to shift the title of the largest model from BERT-Large to GPT-2 (Radford et al. 2019) in 2020 while the number of parameters of GPT-2 is around 5 times larger than that of BERT-Large. Moreover, GPT-2 further evolves into GPT-3 (Brown et al. 2020) with 175 Billion parameters. More recently, GLM (Du et al. 2021) has clinched the title with surprisingly 1.75 Trillion parameters. These large models consume more data and have better performance than their smaller counterparts
AI模型不断变大的发展趋势