Sean Lie, Co-Founder and Chief Hardware Architect | August 24, 2021
Today at the Hot Chips conference, we proudly unveiled the world’s first multi-million core AI cluster architecture! Our unique technology handles neural networks with up to an astonishing 120 trillion parameters. That’s approximately the number of synapses in the human brain! Today, to run models at only a fraction of that size, clusters of graphics processors consume acres of space, megawatts of power, and require dedicated teams to operate. We can fix that.
To unlock the potential of extreme-scale models, we realized a new approach is needed. One that addresses the challenge of scaling massive memory, compute, and communication all at once. Such a solution must be able to support the largest models of today, and scale to empower the models of tomorrow.
Building on the architectural foundations of the second-generation Cerebras Wafer-Scale Engine (WSE-2) which is at the heart of our CS-2 system, we set out to solve this challenge as we usually like to do – from the ground up, addressing the most fundamental challenges using a holistic, systems approach for extreme-scale.
We decided to take what has traditionally been the complex, intertwined problems of distributing memory, compute, and communication, and synchronizing all of them at the same time – and disaggregate them. The reason we can do this is that neural networks use memory differently for different components of model computation. We can design a purpose-built solution for each type of memory and each type of compute that the neural network needs, and as a result, untangle them and greatly simplify the scaling problem.
We call this new execution mode “weight streaming”. This mode unlocks unique flexibility, allowing independent scaling of the model size and the training speed. A single CS-2 system can support models up to 120 trillion parameters, and to speed up training, we can cluster up to 192 systems with near-linear performance scaling.
In this mode, we store the model weights in a new memory extension technology called MemoryX, and stream the weights onto the CS-2 systems – as needed – to compute each layer of the network, one layer at a time. On the backward pass, the gradients are streamed in the reverse direction back to the MemoryX where the weight update is performed in time to be used for the next iteration of training. In this topology, we also introduce an interconnect fabric technology called SwarmX which allows us to scale the number of CS-2 systems near-linearly for extreme-scale models.
In addition to scaling capacity and performance, our architecture uniquely enables vast acceleration for sparse neural networks. The AI community is actively creating new sparse models that can achieve the same accuracy but with less compute. Such techniques are critical to achieving extreme-scale practically but traditional architectures cannot accelerate these sparse networks. The Cerebras hardware, on the other hand, uses fine-grained dataflow scheduling to trigger computations only for useful work. This enables us to save power and achieve 10X weight sparsity speedup.
For researchers, this architecture is seamless: users simply compile the neural network mapping for a single CS-2 system, and the Cerebras software takes care of execution as you scale, eliminating the traditional distributed AI intricacies of memory partitioning, coordination, and synchronization across thousands of small devices.
These innovations introduced at Hot Chips continue to push the boundaries of what’s possible in AI, unlocking the incredible potential of extreme-scale models!
Join us on this journey! Click here to connect with our team.
Interested in more details about each component of this architecture? Read on!
Cerebras weight streaming: Disaggregating Memory and Compute
The Cerebras CS-2 system is powered by the Cerebras Wafer-Scale Engine (WSE-2), the largest chip ever made and the fastest AI processor. Purpose-built for AI work, the 7nm-based WSE-2 delivers a massive leap forward for AI compute. The WSE-2 takes up an entire silicon wafer and houses an amazing 2.6 trillion transistors and 850,000 AI-optimized cores. By comparison, the largest graphics processing unit (GPU) chip has “only” 54 billion transistors, 2.55 trillion fewer transistors than the WSE-2! The WSE-2 also has 123x more cores and 1,000x more high performance on-chip memory than competing GPUs.
Cerebras’ weight streaming execution mode builds on the foundation of the massive size of the WSE-2. It is a new execution mode where compute and parameter storage are fully disaggregated from each other. A small parameter store can be linked with many CS-2 systems housing tens of millions of cores, or up to 2.4 petabytes of storage – enough for 120 trillion parameter models – can be allocated to a single CS-2 system.
In the weight streaming mode, the model weights are held in an off-chip storage device called MemoryX, which we’ll talk more about in a moment. The weights are streamed onto the chip where they are used to compute each layer of the neural network. On the backward pass of the neural network training, gradients are streamed out of the chip, back to the central store where they are used to update the weights.
This weight streaming technique works particularly well on the Cerebras architecture because of the WSE-2’s size. Unlike with GPUs, where the small amount of on-chip memory and limited compute resources requires large models to be partitioned across multiple chips, the WSE-2 can fit and execute extremely large layers without traditional blocking or partitioning to break down large layers. This ability to fit every model layer on-chip without needing to partition means each CS-2 system can be given the same workload mapping for a neural network and do the same computations for each layer, independently of all other CS-2 systems in the cluster. For users, this simplicity allows them to scale their model from running on a single CS-2 system, to running on a cluster of arbitrary size without any software changes. Not only are we transforming the scale of models that are possible, we’re transforming the DevOps experience as well.
Cerebras MemoryX Technology: Enabling Hundred-Trillion Parameter Models
Over the past three years, the size of the largest AI models have increased their parameter count by three orders of magnitude, with the largest models now using 1 trillion parameters. A human-brain-scale model will employ a hundred trillion parameters, requiring approximately 2 petabytes of memory to store.
Cerebras MemoryX is the technology behind the central weight storage that enables model parameters to be stored off-chip and efficiently streamed to the CS-2 system, achieving performance as if they were on-chip. It contains both the storage for the weights and the intelligence to precisely schedule and perform weight updates to prevent dependency bottlenecks. The MemoryX architecture is elastic and designed to enable configurations ranging from 4TB to 2.4PB, supporting parameter sizes from 200 billion to 120 trillion.
Cerebras SwarmX Technology: Providing Bigger, More Efficient Clusters
The Cerebras SwarmX technology extends the boundary of AI clusters by expanding Cerebras’ on-chip fabric to off-chip. Historically, bigger AI clusters have come with a significant performance and power penalty. In compute terms, performance scales sub-linearly while power and cost scale super-linearly. As more GPUs are added to a cluster, each contributes less and less to solving the problem.
The SwarmX fabric is specially designed for weight streaming to enable efficient parallel training across CS-2 systems. Sitting between MemoryX and CS-2 systems, the SwarmX fabric broadcasts weights to, and reduces gradients from, all CS-2 systems. This makes the SwarmX fabric an active participant in the training process. The SwarmX fabric uses a tree topology to enable modular and low-overhead scaling.
The Cerebras SwarmX fabric enables clusters to achieve near-linear performance scaling, meaning that 10 CS-2 systems, for example, are expected to achieve the same solution 10x faster than a single CS-2 system. The SwarmX fabric scales independently of MemoryX resources – a single MemoryX unit can be used to target any number of CS-2 systems. The SwarmX fabric is designed to scale from 2 CS-2 systems to up to 192 systems and, since each CS-2 system has 850,000 AI-optimized cores, will enable clusters of up to 163 million AI-optimized cores!
Cerebras Sparsity: Smarter Math for Reduced Time-to-Answer
The AI community has created many different algorithms to reduce the amount of computational work to reach a solution by introducing sparsity, but these algorithms cannot be accelerated on traditional architectures. The Cerebras WSE-2 handles sparsity at the silicon level, thus enabling customers to take advantage of these new algorithms, and reduce time-to-answer.
With sparsity, the premise is simple: multiplying by zero is a bad idea, especially when it consumes time and power. And yet, GPUs multiply by zero routinely. As the AI community grapples with the exponentially increasing cost to train large models, the use of sparsity and other algorithmic techniques to reduce the compute FLOPs required to train a model to state-of-the-art accuracy is increasingly important.
The Cerebras WSE-2 is based on a fine-grained data flow architecture. Its 850,000 AI optimized compute cores are capable of individually ignoring zeros regardless of the pattern in which they arrive. This selectable and automatic sparsity harvesting is something no other traditional architecture is capable of. The dataflow scheduling and tremendous memory bandwidth unique to the Cerebras architecture enables this type of fine-grained processing to accelerate all forms of sparse neural networks, even fully unstructured weight sparsity.
Push Button Configuration of Massive AI Clusters
Large clusters have historically been plagued by set up and configuration challenges, often taking months to fully prepare before they are ready to run real applications. Preparing and optimizing a neural network to run on large clusters of GPUs takes yet more time. To achieve reasonable utilization on a conventional cluster takes painful, manual work from researchers who typically need to partition the model, spreading it across the many tiny compute units; manage both data parallel and model parallel partitions; manage memory size and memory bandwidth constraints; and deal with synchronization overheads. And this task may need to be repeated for each network, each framework. In short, the cost to experiment is high.
By bringing together weight streaming, MemoryX and SwarmX technologies, Cerebras makes the process of large cluster building push-button simple. Cerebras’ approach is not to hide distribution complexity by papering over it with software. Cerebras has instead developed a fundamentally different architecture which removes the scaling complexity altogether. Because of the size of the WSE-2, there is no need to partition the layers of a neural network across multiple CS-2 systems – even the layers of multi-trillion parameter models can be mapped to a single CS-2 system.
Unlike in GPU clusters where each graphics processor holds a different part of the neural network, each CS-2 system in a Cerebras cluster will have the same software configuration. Adding another CS-2 system changes almost nothing in the execution of the work, so running a neural network on dozens of CS-2 systems will look the same to a researcher as running on a single system. Setting up a cluster will be as easy as compiling a workload for a single machine and applying that same mapping to all the machines in the desired cluster size.
Cerebras weight streaming technology enables users to run neural network applications on massive clusters of CS-2 systems with the programming ease of a single system.
We started Cerebras in 2016 with a mission to build the biggest, most audacious, most powerful AI accelerator in the world. Then we doubled its performance. Now, even before you’ve all had a chance to absorb this amazing device, we’re doing it again. Showing a practical path to neural networks of almost unimaginable potential. I can’t wait to see what our customers will do with these new capabilities. It’s going to be amazing.
Join us on this journey! Click here to connect with our team.
Recommended further reading: Weight Streaming whitepaper
A deeper dive into the technology of weight streaming, including a survey of existing approaches used to scale training to clusters of compute units and explore the limitations of each in the face of giant models.
Related Posts
August 28, 2024
Integrating LLMs and Software Engineering for Self-Refining Copy Creation
Discover how to build an AI agent that generates marketing copy efficiently…
August 28, 2024
ReadAgent: Bringing Gist Memory to AI
Learn how gist memory improves long context handling for large language models.…
2 Comments
Comments are closed.
[…] : Cerebras, via Tom’s Hardware […]
[…] : cerebro, a través de la Tom’s Hardware (EE. […]