Joel Hestness, Research Scientist; Eugene Vecharynski, Machine Learning Tech Lead; Stewart Hall, Member of Technical Staff; Sean Lie, chief hardware architect and co-founder | June 22, 2022
Challenges | Scaling Impacts | ML Impacts | Cerebras Advantage | Summary
The last few years have shown unprecedented growth in the size and complexity of natural language processing (NLP) language models. The result is that the models are so large that they must be trained using hundreds or thousands of conventional processors in super-computer-style clusters. So complex are the models, and so difficult are the compute clusters to set up, that only a very small portion of the AI community can train them. Only a handful of companies around the world have this kind of capability.
We are announcing the largest models ever trained on a single device. Using the Cerebras Software Platform (CSoft), our customers can easily train state-of-the-art GPT language models (such as GPT-3[i] and GPT-J[ii]) with up to 20 billion parameters on a single CS-2 system. Running on a single CS-2, these models take minutes to set up and users can quickly move between models with just a few keystrokes. With clusters of GPUs, this takes months of engineering work.
At Cerebras, we are making the largest NLP models simple to use for the first time. For everyone.
Challenges of scaling ML across clusters of GPUs
To understand the simplicity and elegance of the Cerebras approach, it is important to spend a few minutes understanding the complexity of the conventional alternative.
Spreading the training of a large network over a cluster of processors is a difficult distributed computation problem. This applies to large neural networks, and to all other computational problems that don’t fit on a single processor. The work necessary to break up compute, memory, and communication, and then distribute them across hundreds or thousands of processors is a very challenging problem. Another complication is the fact that any allocation of compute, memory, and network over a processor cluster is bespoke: the partitioning of a neural network on a particular processor cluster is unique to that neural network and that cluster hardware. A different neural network would require a different partitioning on the same cluster. And even the same neural network would require a different partitioning on a different cluster. This is well known in distributed computing, but is something we seem to have forgotten as AI has rapidly, and by necessity, overlapped with distributed computation.
Distributing a neural network over a specific cluster of processors is bespoke because there are unique characteristics of the neural network combined with unique characteristics of each of the processors in the cluster, as well as unique characteristics of the communication network tying the processors together (Figure 1). That is to say, the characteristics of the model (size, depth, parameters, communication structure), interact with the characteristics of each of the processors (compute performance, memory capacity, memory bandwidth) and interact with the characteristics of the communication network (topology, latency, bandwidth) to determine how to distribute the neural network over the cluster.
This is painstaking work, and often takes months. And this work is not ML! It’s distributed computing work. Interestingly, what is most challenging about very large ML is often not the ML. It’s the distributed computing.
Let’s examine this in more detail. Consider a simple neural network with four layers as shown in Figure 2. Each layer is denoted by a color. Each layer does calculations and then sends the results to the next layer, which in turn uses these results in its calculations.
If a neural network, or any computational problem, fits on a on a single processor, life is good. There is sufficient memory for all the parameters and there is sufficient compute on processor for the largest layers to be calculated. All that needs to be done is to put the neural network onto the processor, feed it data, and get answers. These are the good times.
Data Parallel
If a neural network fits on a single processor, it’s easy to train. With multiple processors, it’s then possible to use data parallelism to reduce the time it takes to complete training.
For data parallel training, we duplicate the entire network on Processor 1 and Processor 2, then split the data in half, as shown in Figure 3. Then we send half the data to Processor 1 and half to Processor 2, and average the results. This is called data parallel because the data is split in half and run in parallel. If done well, training should take approximately half the time with two processors as it took with one.
Despite the simplicity, achieving an optimal linear scaling using the data parallel approach is not trivial. This is especially evident in cases where the number of processors is large or when they reside in different servers of a cluster. Additionally, data parallelism boosts the effective batch size of the training run, which, if too large, hinders the neural network training convergence. It also requires a careful reconsideration of hyperparameter and optimizer choices every time the cluster configuration is changed.
In practice, large natural language processing models don’t fit on single GPUs. The GPU does not have enough on-chip SRAM memory for the parameters, and doesn’t have enough compute for the entire model. (Note that even the fastest off-chip DRAM is far too slow to be useful for this workload.)
What now? There are two main choices. Either some of the layers can be placed on one GPU and other layers on different GPUs, or parts of a layer can be placed on one GPU and other parts of the layer on other GPUs, which is required if one layer is too large for a GPU. These are called Pipeline Model Parallel and Tensor Model Parallel, respectively.
Pipeline Model Parallel
For Pipeline Model Parallel, the problem is broken up by allocating some of the layers to Processor 1 and some of the layers to Processor 2, as shown in Figure 4. The difficult part of this form of parallelization is that the layers run in a pipeline, i.e. the results from a given layer must traverse from Processor 1 onto Processor 2 before Processor 2 can begin. This puts tremendous pressure on network bandwidth and latency.
Additionally, bear in mind that training neural networks requires back-propagation that uses the activations generated on the forward pass. Thus, when running as a pipeline, training requires additional activation memory to keep the pipeline full. In fact, the total activation memory grows as the square of the number of pipeline stages, O(N2), where N is the number of pipeline stages. The additional memory can be prohibitive and exceed that of the individual processors, especially with higher parallelization. Furthermore, the additional memory capacity is not uniformly distributed since the earlier stages in the pipeline need to store more activations than the later stages in the pipeline.
Therefore, to distribute a neural network in a pipeline model parallelization approach, the user must deeply understand the computation in each layer, the amount of I/O needed between layers, and the additional non-uniform memory requirements of the pipeline. They must then find the exact mapping considering the amount of GPU compute, the I/O bandwidth between GPUs within a server, the I/O bandwidth in the network topology between servers, and the amount of memory. This is a bespoke solution for the neural network and the cluster pair.
In summary, pipeline model parallelism is brutally complex.
It gets worse.
Tensor Model Parallel
What if even a single layer doesn’t fit on a graphics processor? Then, tensor model parallel must be used. Here, an individual layer is split across multiple processors. So, part of layer 1 is placed on Processor 1 and part of the layer 1 is placed on Processor 2, as shown in Figure 5. This adds yet another dimension of complexity, pressure on bandwidth and I/O, and it must be done manually by the user. This takes weeks or months of painstaking instrumentation of each layer, processor, and network performance. The speedup gained from this work will also be limited by fundamental properties of the model and cluster. Since tensor model parallelism requires frequent communication, at the layer level, latency of these communication operations between processors becomes a bottleneck when scaling beyond a single GPU server. That’s why, in practice, the degree of parallelism afforded by this method is usually limited to 4 or 8, which is the number of GPUs that fit in a single server. The network protocols needed to communicate between servers are just too slow.
Hybrid Parallelism
Due to the size and hardware limitations, to train the largest neural networks on GPU clusters, all three techniques must be used: data parallel, pipeline model parallel, and tensor model parallel. We examined some of the largest networks ever published, looking at the number of GPUs necessary to run the neural network and the different types of parallelization necessary to train a variety of models (Figure 6).
In the “small” networks (up to 1.2B parameters), the network could run data parallel. Even in these simpler cases, the degree of parallelism is quite high at 64 GPUs. Since this involves distribution of training across multiple physical servers, it requires complicated software to coordinate the work. To avoid a performance bottleneck, a high-performance interconnect network and very efficient reduction primitives are needed for the communication phase when gradients are aggregated after each step. There is also a limit to the batch size, especially for these smaller models, which limits the degree of data parallelism that can be leveraged without running into this communication bottleneck.
At 2.5B parameters, we are seeing the limits of data parallel, so the researchers began to split their models using tensor model parallelism. 128 GPUs were used with 64-way data parallelism coming from splitting the batch, and another factor of 2 coming from tensor model parallelism.
By 20B parameters, all three types of parallelization are necessary. This is precisely why so few companies can train very large models. And when they do it’s noteworthy and publication-worthy.
To summarize, large NLP neural networks don’t fit on a single graphics processor. And as a result, it takes months of painstaking work to measure the compute in each layer, measure the bandwidth and latency between processors, decide how to distribute the work, where to send each part of it, where to gather the intermediate results, where to send these results to the next group of processors. This process must be done layer by layer for neural networks with billions of parameters. And when this process is complete, the configuration is unique to specific neural network-cluster pair in question. It’s not portable. None of the work applies to different clusters, and none of the work applies to different neural networks.
This is the pain of large-scale ML training on GPU clusters.
Scaling Impacts of Extreme Parallelism
In addition to the engineering work required to distribute training over a cluster, there are fundamental limitations which result in diminishing returns. Many of these become clear when looking at the quantitative impact of scaling to more GPUs on compute, memory, and bandwidth requirements.
Bandwidth becomes a limiting factor in large clusters where the reduced compute performed by each processor is dwarfed by the time required to communicate. This is seen in tensor model parallelism when the latency of communication between all processors is incurred at each layer boundary and in data parallelism when the critical batch size is reached, and time spent synchronizing weights begins to dominate. Compounding the problem is the fact that communication tends to become more costly as the number of processors grows. See the Appendix for more details, including the scaling equations that illustrate these effects.
These effects compound to result in sub-linear speedup scaling on GPUs as shown in Figure 7. The figure shows TPU and GPU scaling on the MLPerf-Transformer benchmark.[iii] Ideally, as you increase the number of processors, the performance should also increase by the same amount, that’s linear scaling. However, due the various limitations discussed above, it’s common to see highly sub-linear scaling. In the MLPerf-Transformer example, by increasing the number of processors by 64x (1024 chips on the x-axis) the TPU performance is only increased by a factor of approximately 12x. That’s over 5x less than linear scaling. Not only is distributed training on GPUs extremely complex, the result is often highly inefficiently speedup scaling.
ML Impacts of Extreme Parallelism
Even after navigating the painful process of parallel to use many devices, sometimes training a large ML model can result in suboptimal use of compute from the ML perspective. The goal is not simply to achieve higher training throughput. It’s most important that the throughput results in achieving neural network accuracy most efficiently, in the least amount of time. Due to the complexities of distributed training on GPUs, this is often not achieved in practice.
For example, Hoffmann et al. at DeepMind recently showed that OpenAI’s original approach for training GPT-3 was inefficient[iv]. They showed that by using slightly smaller models and training over more samples with longer learning rate schedules, they were able to achieve higher accuracy than GPT-3 models with the same compute budget. These slightly-smaller-but-still-very-large models are harder to train efficiently across many devices, because there is less model-level parallelism available.
Further, we can’t just use more data parallelism. The original GPT-3 training runs used a batch size of 1536 sequences spread across 256 workers, for just 6 sequences per worker. If OpenAI had used smaller per-worker batch sizes, they probably would have seen decreased utilization per worker, again because the amount of parallelism available in 6 sequences is small. If they wanted to use more devices through data parallelism, they would need to increase the total batch size, keeping the per-worker batch size the same. However, OpenAI’s own prior work shows that at a certain batch size—the “critical” batch size—increasing the total batch size further will lead to diminishing improvement in the gradient’s information[v]. As a result, extra data parallelism and compute will not lead to further increases in model accuracy.
By simplifying the parallelism structure required to train large LMs, the Cerebras architecture enables more optimal ML training configurations that lead to higher model accuracy in the same compute budget.
The Cerebras advantage
Cerebras avoids all the complexity described above. We can easily fit the largest NLP networks on a single Wafer-Scale Engine processor within a CS-2 system. As a result, we can train GPT models up to 20B parameters on a single CS-2. That includes popular variants such as GPT-3 XL 1.3B, GPT-J 6B, GPT-3 13B, and GPT-NeoX 20B. This eliminates months of engineering work and brings access to these models to companies without huge teams of distributed compute engineers. This is a huge milestone for the AI community.
We can do this for two reasons. First, we have a very big chip! Second, we have an architecture that allows us to disaggregate memory from compute, allowing us to scale model memory independent of compute. This enables us to support models with arbitrarily large parameter counts on a single system.
World’s most powerful processor chip
First, our Wafer-Scale Engine is physically 56 times larger than the largest GPU. We have 123 times more cores, 1,000 times more on-chip memory, 12,000 times more memory bandwidth, and 45,000 times more fabric (Figure 8). These are the resources that enable us to fit the largest layers of the largest neural networks onto a single wafer. In fact, on the Wafer-Scale Engine, we can fit layers 1000 times larger than the largest layer in the largest NLP network!
Weight streaming architecture
Second, our weight streaming architecture enables us to disaggregate memory for parameters and compute. Storage for model parameters is in the separate MemoryX unit while all the compute is in the CS-2 system, as show in Figure 9. The MemoryX unit provides persistent storage for the parameters and optimizer states. It accepts weight gradients and uses stored optimizer parameters to compute weight updates between training iterations. On each training iteration, a batch of samples from the dataset is received by the CS-2, then each layer’s activations are computed in sequence as weights are streamed to the CS-2 from the MemoryX unit. Activations computed during the forward pass and intermediate activation gradients computed during the backward pass are stored in wafer memory on the CS-2. During the backward pass the stored activation and activation gradient tensors are used for computation of weight gradients which are sent back to the MemoryX unit.
Each component of the solution is optimized for its specific role. The CS-2 is optimized for extreme floating-point compute throughput and memory bandwidth, which is ideal for processing linear algebra operations used in training. The MemoryX unit is optimized for high capacity which can be scaled separately and independently from the number of CS-2s being used. This enables a single CS-2 to support any size neural network.
This is in stark contrast to GPUs, in which memory and compute are locked together. If you have a GPU with 80GB of memory, and your neural network needs 82GB you must buy more GPUs to get more memory and deal with the complexity of distributing across multiple GPUs. Nothing can be done with the single GPU. This becomes a challenge when the parameters take hundreds or thousands of gigabytes of memory, forcing you to purchase vastly more GPUs than you want or need, and deal with the complexity to distribute over them all. In contract, the Cerebras weight streaming architecture enables the amount of memory to be scaled up independently, since the memory is disaggregated in the external MemoryX unit. More details on our weight streaming architecture can be found in this whitepaper.
To summarize, at Cerebras we can fit the largest neural networks, and the largest layers of the largest neural networks on the Wafer-Scale Engine. Because the Wafer-Scale Engine is the largest chip ever built, it has vastly more computational resources than the GPUs, and we can fit any parameter size network on the Wafer-Scale Engine because we disaggregated the memory and separated it from compute, allowing it to grow independently.
Extremely easy to scale
The result is not only that the largest NLP models can be run on a single system, but that they can be set up quickly and easily debugged without worrying about complex hybrid parallelism. Running a 20B parameter model is exactly the same as running a 1B parameter model, just a change of a few numbers in a configuration file. That’s it! This is not possible on GPUs, even with sophisticated distributed software frameworks, because of the fundamental challenges of distributing the work across the small devices. For a walkthrough of training and reconfiguring large models, see our blog post here.
Summary
We see no end in sight for the growth of NLP models. The complexity of delivering compute to train these models is enormous if done with traditional GPUs. As a result, very few companies have access to training these models, and many of them have only ever been trained on a single dataset because the process of re-training is intractable.
Our customers can now train multi-billion parameter NLP models on a single system. These large models, rather than taking months to deploy, now take minutes. Moving among different model sizes only takes a few keystrokes and there is no complex parallelism. They use a fraction of the power and space. They are accessible to organizations who don’t have large distributed systems engineering teams.
We’re making these large NLP models, and the ability to train them, available to everyone.
Appendix
The table in Figure 10 gives the scaling equations of memory and communication requirements of the three parallelism approaches. From the equations, it becomes evident which terms limits each approach for large models, as highlighted in the last column.
References
[i] Brown et al., Language Models are Few-Shot Learners. ArXiv. July 2020. https://arxiv.org/abs/2005.14165
[ii] Komatsuzaki et al., GPT-J-6B: 6B JAX-Based Transformer https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/
[iii] Tim Rogers and Mahmoud Khairy, “An Academic’s Attempt to Clear the Fog of the Machine Learning Accelerator War”, ACM SIGARCH, 2021 https://www.sigarch.org/an-academics-attempt-to-clear-the-fog-of-the-machine-learning-accelerator-war/
[iv] Hoffmann et al., Training Compute Optimal Large Language Models. ArXiv. March 2022. https://arxiv.org/abs/2203.15556
[v] McCandlish et al., An Empirical Model of Large Batch Training. ArXiv. December 2018. https://arxiv.org/abs/1812.06162
Related Posts
August 28, 2024
Integrating LLMs and Software Engineering for Self-Refining Copy Creation
Discover how to build an AI agent that generates marketing copy efficiently…
August 28, 2024
ReadAgent: Bringing Gist Memory to AI
Learn how gist memory improves long context handling for large language models.…