Dhiraj Mallick, VP Engineering & Business Development | August 16, 2021
AI accelerator chips have made machine learning a reality in nearly every industry. With unprecedented pace of compute demand, model size and data growth, the need for high performance and more efficient solutions is growing rapidly. With Moore’s Law not keeping up with this demand, AI accelerators desperately need to innovate at the system and algorithmic level to satisfy the anticipated needs of AI workloads over the next several years.
Cerebras Systems has built the fastest AI accelerator, based on the largest processor in the industry and it’s easy to use. The system is based on a 7nm device that contains 850,000 specialized AI compute cores on a single wafer-scale chip. This single wafer compute engine is known as Wafer Scale Engine 2 (WSE-2).
Cluster-Scale In a Single AI Chip
The WSE-2 is by far the largest silicon product available, with a total silicon area of 46,225mm². It utilizes the maximum square of silicon that can be made out of a 300mm diameter wafer. The square of silicon contains 84 die that are 550mm² each. These die are stitched together using proprietary layers of interconnect, making a continuous compute fabric. By developing this interconnect on a single piece of silicon, we can connect the equivalent of 84 die and significantly lower the communication overhead and physical connections within the systems.
By connecting all 850,000 AI cores in this manner, users of our system (the CS-2) get unprecedented 220Pb/s of aggregate fabric bandwidth. Our proprietary interconnect and on-silicon wires lower the communication overhead and give a significantly better power performance than moving large AI workloads between discrete chips. This innovative wafer-scale technology’s advantage also includes 40GB of on-“chip” (wafer) memory allowing local storage of intermediate results that are normally stored off chip, thus reducing the user’s access time.
Designed with Sparsity In Mind
Deep neural net computations often contain a high number of zeros. This generates an opportunity to reduce the number of computations. The result of multiplying any number by zero is always zero. Adding zero to the accumulated result also has no effect. This allows a multiply-accumulate operation to be removed if one of a multiplier’s operands is zero. Tensors containing many zeros are referred to as sparse tensors. The WSE-2 is designed to harvest the sparsity from sparse tensors and vectors. In comparison, traditional GPU architectures will perform the unnecessary computations, wasting power and computational performance.
The WSE-2 harvests sparsity by taking advantage of Cerebras’ dataflow architecture and fine-grained compute engine. Compute cores communicate back and forth with their neighbors. A compute core sending data will filter out any zero data that would otherwise have been passed to their neighbor. The dataflow protocol results in the receiving core not performing these unnecessary calculations. Instead, it will just skip forward to the next useful computation.
Harvesting sparsity saves power and significantly improves performance. Algorithms such as ReLU in the forward pass and Max Pool in the backward pass of training can be used to introduce sparsity. Small weights that are close to zero can also be rounded to zero without loss of accuracy. By appropriate use of such functions, Cerebras math kernels can exploit the capabilities of WSE-2’s sparsity harvesting features.
Pushing the Limits of AI On-Chip Memory
The WSE-2 40 Gigabytes of on-chip memory is broken to 48kB sub-arrays associated with each of the 850,000 compute cores. The local storage is sufficient to store locally any reusable activations, weights, intermediate results and program code. The total memory bandwidth on WSE-2 is 20 Petabytes/sec, orders of magnitude more than could be achieved with typical off-chip memory architectures. This close coupling of memory and compute keeps data as local as possible to the computing engine, thereby driving up utilization and performance. The latency overhead to move data between caches and off-chip memories is significantly reduced.
The power needed for moving data on- and off-chip is also saved. On-chip memory significantly contributes to the performance/watt benefits of the WSE-2. Our massive on-wafer memory bandwidth also enables full performance at all BLAS levels. While GPUs are typically used for Matrix-Matrix operations, our engine is also optimized for Matrix-Vector and Vector-Vector operations. This gives us a significant performance advantage in both training and real-time inference.
Built for Faster Time-to-Solution
The fabric between computing cores is uniform across the entire 46255 mm² of the WSE-2. Each core has links to its North, East, South and West neighbors. At die boundaries, the fabric is continuous across the boundary. This uniformity is important for SW. Unlike traditional AI chips, there is no need for kernel programmers and data scientists to consider where on the chip their code will be placed. By having a uniform fabric bandwidth between all compute cores, user code does not need to be optimized for its placement on the chip. This significantly optimizes the user’s time-to-solution. The aggregate fabric bandwidth of 220Pb/s is orders of magnitude larger than would be achievable with off-chip interfaces. For comparison, this is equivalent to over 2 million 100Gb ethernet links.
On-wafer wires are significantly more powerful and latency efficient than going over high-speed external interfaces. The massive on-wafer bandwidth enables us to map a single problem to the full wafer. This includes physically mapping single layers across multiple die and mapping multiple layers across the entire wafer. The Cerebras architecture can achieve very high utilization even on large matrices. The sustained throughput doesn’t drop off, like on today’s machines, with increasing model size. Our architecture allows us to almost perfectly overlap compute and communication and we are not as susceptible to data movement overheads.
The Cerebras Advantage
In summary, Cerebras’ WSE-2 gives unprecedented levels of computation, memory and interconnect bandwidth on a single, wafer-scale piece of silicon. Further optimizations by sparsity harvesting allow the computation capabilities to be maximized. The outcome is huge performance in an integrated chip without bottlenecks, in which every node is programmable and independent of others. With this revolutionary approach to AI, you get to reduce the cost of curiosity.
The net result of our innovation to date is unmatched utilization, performance levels and scaling properties that were previously unthinkable. And we’re just getting started — we have an exciting roadmap of Wafer Scale Engines that will deliver even more improvements over our market-leading WSE-2.
Interested to learn more? Sign up for a demo!
At the Hot Chips 33 conference, our co-founder, Sean Lie, unveiled our exciting new weight streaming technology, which extends the Cerebras architecture to extreme-scale AI models. Learn more here.
Related Posts
August 28, 2024
Integrating LLMs and Software Engineering for Self-Refining Copy Creation
Discover how to build an AI agent that generates marketing copy efficiently…
August 28, 2024
ReadAgent: Bringing Gist Memory to AI
Learn how gist memory improves long context handling for large language models.…