This has been a year of colossal AI models.
The neural network’s apparent understanding of English was remarkable when OpenAI published GPT-3 in June 2020. It could make convincing words, communicate with people, and even autocomplete code. GPT-3 was also massive in size, dwarfing any other neural network ever created. It ushered in a new era in AI, one in which larger is better.
Despite GPT-3’s tendency to mimic the bias and toxicity inherent in the online text it was trained on, and despite the fact that teaching such a large model its tricks requires an unsustainable amount of computing power, we chose GPT-3 as one of our breakthrough technologies of 2020—for good and ill.
In 2021, however, the influence of GPT-3 became even more apparent. This year has seen a flood of huge AI models developed by a variety of tech companies and prominent AI laboratories, many of which are larger and more capable than GPT-3. How big can they grow and how much will it cost?
GPT-3 drew international notice not only for what it could achieve, but also for how it did it. The dramatic improvement in performance, particularly GPT-3’s ability to generalise across language tasks for which it had not been specifically trained, was due not to better algorithms (though it does rely heavily on a type of neural network called a transformer invented by Google in 2017), but to sheer size.
In a panel discussion at NeurIPS, a premier AI conference, Jared Kaplan, a researcher at OpenAI and one of the designers of GPT-3, remarked, “We thought we needed a fresh idea, but we got there purely through size.”
“We continue to witness hyperscaling of AI models resulting to higher performance, with seemingly no end in sight,” wrote a pair of Microsoft researchers in a blog post introducing the company’s huge Megatron-Turing NLG model, developed in partnership with Nvidia, in October.
What does it mean to have a huge model? The number of parameters in a model—a trained neural network—is used to determine its size. These are the network values that are modified repeatedly during training and subsequently utilised to produce the model’s predictions. The more parameters a model has, roughly speaking, the more information it can absorb from its training data and the more accurate its predictions about new data will be.
GPT-3 has 175 billion parameters, which is ten times more than GPT-2. The class of 2021, on the other hand, dwarfs GPT-3. With 178 billion parameters, Jurassic-1, a commercially accessible big language model released by US startup AI21 Labs in September, beat out GPT-3. DeepMind’s Gopher model, which was unveiled in December, includes 280 billion parameters. There are 530 billion Megatron-Turing NLGs. The Switch-Transformer and GLaM models from Google each have one and 1.2 trillion parameters.
The trend is not limited to the United States. This year, Huawei, a Chinese tech company, developed PanGu, a 200-billion-parameter language model. Another Chinese company, Inspur, created Yuan 1.0, a 245-billion-parameter model. PCL-BAIDU Wenxin, a model with 280 billion parameters that Baidu is already employing in a range of applications, including internet search, news feeds, and smart speakers, was announced by Baidu and Peng Cheng Laboratory, a Shenzhen-based research centre. Wu Dao 2.0, which includes 1.75 trillion parameters, was announced by the Beijing Academy of AI.
Meanwhile, Naver, a South Korean internet search company, has announced the HyperCLOVA model, which has 204 billion parameters.
Every one of these is a significant engineering achievement. To begin with, training a model with more than 100 billion parameters is a complicated plumbing problem: hundreds of individual GPUs—the preferred hardware for training deep neural networks—must be connected and synchronised, and the training data must be split into chunks and distributed between them in the correct order and at the correct time.
Large language models have evolved into prestige projects that highlight a firm’s technical expertise. However, few of these new models go beyond reiterating the demonstration that scaling up produces favourable benefits.
There are a few noteworthy innovations. Google’s Switch-Transformer and GLaM employ a percentage of their parameters to make predictions after they have been trained, which saves computational power. PCL-Baidu Wenxin combines a GPT-3-style model with a knowledge graph, which is a method for storing facts in old-school symbolic AI. In addition to Gopher, DeepMind also launched RETRO, a language model with only 7 billion parameters that competes with models 25 times its size by generating text by cross-referencing a database of documents. As a result, RETRO is less expensive to train than its larger competitors.