Aug 09 2024

New Tool Generates Stencil Codes Two Orders of Magnitude Faster on Cerebras WSE Than on GPUs

Introduction

Scientific computing, particularly in fields like seismic imaging, weather forecasting, and computational fluid dynamics, heavily relies on stencil computations. While stencil computations are conceptually simple, achieving high performance across different hardware architectures is challenging. Researchers at Rice University (R. Sai and J. Mellor-Crummney) and TotalEnergies (M. Araya-Polo) addressed these challenges by developing StencilPy, a portable, high-performance optimized code generator for stencil computations on current CPU, GPU, and wafer-scale solutions.

This research report highlights StencilPy’s performance across various architectures, including AMD CPUs, Nvidia GPUs, and the Cerebras CS-2 system. StencilPy reduces the amount of code required on a Cerebras CS-2 for stencil computations by a factor of 7, achieving impressive results with a high-level, easy-to-use, domain-specific interface.

Their results show that the Cerebras CS-2 delivers unprecedented speedups for stencil computations compared to other leading hardware platforms, specifically the Cerebras CS-2 was:

95x faster than CUDA on an NVIDIA H100
292x faster than HIP on an AMD MI210
570x faster than OpenMP on an AMD Genoa

The Unique Architecture of Cerebras’s Wafer-Scale Systems

The Cerebras CS-2 and its successor, the CS-3 system, are designed for high-performance computing tasks. They are designed around a Wafer-Scale Engine (WSE), the largest commercially available processor in existence. Using wafer scale integration, the Cerebras WSE integrates an entire wafer worth of transistors into a single, massive processing chip. The WSE is a wafer scale, MIMD architecture machine. It is an SRAM based, near memory machine with 900,000 AI cores, 44 Gigabytes of on chip memory, and 21 Petabytes of memory bandwidth. These computational resources are organized as a fully distributed array of cores each with its own dedicated memory. The massive memory bandwidth created by this design, in combination with the dataflow architecture, offers key advantages for scientific computing, especially stencil computations.

Rice University and TotalEnergies’ experimental work was on the Cerebras CS-2, though the work applies equally well to the CS-3. The CS-3 is based on the third-generation wafer scale processor, the WSE-3, which increases performance, and provides additional features.

The WSE’s enormous on-chip memory eliminates the need for frequent off-chip data transfers that bottleneck traditional architectures. All data is stored in fast on-wafer SRAM memory immediately adjacent to the computing cores (memory accesses are 1 clock cycle), eliminating the need for a cache hierarchy. The WSE’s high-bandwidth interconnect efficiently moves data between computing cores, which is crucial for stencil computation performance. The WSE-3’s architecture allows for massive parallelism with up to 900,000 AI-optimized computing cores.

This integration of memory, interconnect, and processing on a single silicon wafer minimizes latency and maximizes throughput. It is particularly well-suited for computational tasks that make constant parallel references to all elements of a large data set. By placing memory on the same silicon as processing, the WSE provides the low-latency, extreme bandwidth data access that stencil computations need, without requiring large amounts of data reuse to reach maximum efficiency—a key advantage over traditional memory hierarchies.

StencilPy and CSL: Optimizing Software for Cerebras Hardware

To fully leverage the WSE’s hardware capabilities, Cerebras developed the Cerebras Software Language (CSL). CSL allows users to extract maximum performance from the hardware by providing programmatic access to the WSE’s unique features. StencilPy builds on top of CSL, offering a high-level, domain-specific language for defining stencil computations. This integration allows developers to write stencil computations in a concise, high-level syntax, which StencilPy then translates into highly optimized CSL code for execution on the WSE.

Introducing a new data-flow IR (DFIR), StencilPy’s CSL backend generates router configurations, handles communication switching and scheduling, and produces vectorized code for various 3D stencils.

The framework’s modular design enables easy configuration and extension, allowing researchers to adapt it to various stencil computation patterns and optimize performance across different hardware platforms.

The StencilPy framework’s evaluation focused on a 25-point star-shaped stencil, used for hyperbolic wave equations in seismic acoustic wave propagation. This test case is significant for scientific computing as it represents a high-order stencil, involving a larger number of neighboring points in its computation compared to lower-order stencils. High-order stencils are widely used in applications such as seismic imaging and weather forecasting, providing more accurate simulations by approximating the underlying differential equations more precisely.

Efficiently handling such a stencil requires significant computational power and memory bandwidth, where the WSE’s unique architecture provides a substantial advantage. The evaluation showed that the CSL backend on the Cerebras WSE achieved a 95x speedup compared to CUDA on the NVIDIA H100 GPU, a 292x speedup over HIP on the AMD MI210 GPU, and a 570x speedup over OpenMP on the AMD Genoa CPU. These performance gains demonstrate the WSE’s exceptional capability in handling complex stencil computations.

Key Differentiators of Cerebras for Scientific Computing

Several key factors contribute to Cerebras’ superior performance in scientific computing. The WSE’s massive parallelism, combined with its ability to maintain uniform computation patterns across its vast array of computing cores, makes it ideally suited for stencil computations. The system efficiently handles high-order stencils, which are particularly challenging for traditional architectures due to their communication intensity. Moreover, the WSE demonstrates excellent scalability for large problem sizes, a critical factor in real-world scientific simulations.

Additionally, StencilPy reduced the amount of code required for stencil computations on the WSE by a factor of 7 compared to hand-optimized implementations. This reduction in code complexity improves developer productivity, enhances code maintainability, and reduces the potential for errors.

In Summary

There are appealing directions for further optimizing and expanding the StencilPy framework on the Cerebras platform. The CSL backend can be optimized further, yielding improved performance. Ongoing work to expand support for larger data grids on the WSE can open new possibilities for large-scale scientific simulations.

The Cerebras WSE architecture, powering Cerebras CS-2 and CS-3 systems, coupled with the StencilPy framework, promises a breakthrough in scientific computing. Its unparalleled performance in stencil computations, combined with reduced code complexity and excellent scalability, positions the WSE as a game-changing platform for researchers and developers in fields requiring high-performance scientific computing. As research and development in this area continue, we can expect to see even more groundbreaking applications of this technology.

To learn more about StencilPy, please read the paper: https://arxiv.org/pdf/2309.04671

To access our SDK, please sign up here: https://cerebras.ai/developers/sdk-request/

To read about our recent SDK research: https://cerebras.ai/blog/supercharge-your-hpc-research-with-the-cerebras-sdk