2021 is the year of the monster AI model


What does it mean that the model is large? The size of a model (a trained neural network) is measured by the number of parameters it has. These are the values ​​in the network, adjusted again and again during the training process, and then used to make predictions for the model. Roughly speaking, the more parameters a model has, the more information it can absorb from the training data, and the more accurate it can predict new data.

GPT-3 has 175 billion parameters, 10 times that of its predecessor GPT-2. However, GPT-3 pales in comparison with the 2021 level. The commercial large-scale language model Jurassic-1 launched by the American start-up company AI21 Labs in September leads GPT-3 with 178 billion parameters. The new model Gopher released by DeepMind in December has 280 billion parameters. Megatron-Turing NLG has 530 billion. Google’s Switch-Transformer and GLaM models have 1 and 1.2 trillion parameters respectively.

This trend is not only happening in the United States. This year, the Chinese technology giant Huawei built a 200 billion parametric language model called Pangu. Another Chinese company Inspur established a model of RMB 1.0, a parameter of 245 billion. Baidu and Shenzhen Research Institute Pengcheng Laboratory announced PCL-BAIDU Wenxin, a model with 280 billion parameters that Baidu has used in various applications, including Internet searches, news feeds, and smart speakers. The Beijing Institute of Artificial Intelligence released Budo 2.0, with 1.75 trillion parameters.

At the same time, South Korean Internet search company Naver announced a model called HyperCLOVA with 204 billion parameters.

Each of these is a remarkable engineering feat. First, training a model with more than 100 billion parameters is a complex pipeline problem: hundreds of independent GPUs—the preferred hardware for training deep neural networks—must be connected and synchronized, and the training data must be divided into blocks and in the correct Time is divided among them in the correct order.

Large-scale language models have become a prestigious project to show the company’s technical strength. However, few of these new models can move the research forward, rather than repeatedly proving that scaling up will produce good results.

There are some innovations. After training, Google’s Switch-Transformer and GLaM will use a small part of their parameters to make predictions, so they can save computing power. PCL-Baidu Wenxin combines the GPT-3 style model with the knowledge graph, which is a technology used to store facts in the old-school symbol AI.Together with Gopher, DeepMind released vintage, A language model with only 7 billion parameters, it competes with 25 times the other parameters by cross-referencing the document database when generating text. This makes RETRO’s training costs lower than its huge competitors.


Source link

Recommended For You

About the Author: News Center