Matthias Cremon, PhD., Member of Technical Staff | April 13, 2022
Stencil algorithms are at the core of many High-Performance Computing (HPC) applications. They are used to solve several Partial Differential Equations (PDEs), including fluid mechanics, weather forecast or seismic imaging.
One of the main characteristics of stencil algorithms, especially the high order schemes presented later in this post, is that the computation accesses all of the values stored in memory but only uses them in very few arithmetic operations. In other words, each operation requires accessing a lot of input data, and does not spend significant time computing the result. Additionally, stencil algorithms access input data in a nearest-neighbor pattern which doesn’t readily translate to reads from large contiguous chunks of memory from DRAM.
Those problems are known to be poorly suited to hierarchical memory architectures. Such platforms include central processing units (CPUs), graphics processing units (GPUs), and by extension, clusters of those. Those architectures are perfectly suited for computation-heavy workloads, such as dense linear algebra or graphics rendering, where a relatively large number of floating-point operations is performed for each element of data read.
Stencil algorithms, on the other hand, tend to be less computation intensive and are often memory-bound on traditional architectures: the maximum performance is limited by the speed at which data can be transferred and accessed from memory. A few consequences of being memory-bound are:
- Increasing the clock speed of the processing unit will not yield any improvement.
- Scaling issues arise when attempting to solve the problem by adding more computing power (see next paragraph).
The efficiency of data transfers between processing units is the most important factor regarding the performance of memory-bound applications. Increasing the computational power is routinely done by coupling multiple devices together, linked by an interconnect. Transferring data from device-to-device introduces a slowdown for data transfer, as the interconnect is slower than the bandwidth of the fabric.
However, there is another solution: embrace the data-transfer requirements, and write the algorithm in a way that can take full advantage of the bandwidth of Cerebras’ hardware.
The work done by Mathias Jacquelin from our Kernel team, in collaboration with TotalEnergies’ Mauricio Araya-Polo and Jie Meng, recently shared on arXiv, presents a novel way to implement a stencil algorithm on the Cerebras CS-2 System, which is powered by our Wafer-Scale Engine (WSE), packing 850,000 cores onto a single piece of silicon. The algorithm was written in the Cerebras Software Language (CSL), which is part of the Cerebras Software Development Kit. The extremely large memory bandwidth – a total of 20 petabytes/second – of the WSE, paired with highly efficient neighbor-to-neighbor communication and a clever implementation of the algorithm, combine to produce impressive results.
The test problem is a published benchmark case (Minimod) designed by TotalEnergies to evaluate the performance of new hardware solutions. The subject of this work is to solve the acoustic isotropic kernel in a constant density domain. The equation is discretized into a finite difference (FD) and solved using a 25-point stencil. That means that every point in the discretized space communicates with its four neighbors in every dimension.
The Wafer-Scale Engine (WSE) powering the CS-2 can be seen, for the purposes of this post, as an on-wafer, fully distributed memory machine. The approach relies on a tailored-designed localized broadcast patterns that can concurrently send, receive, and compute data, all at the hardware level. Moving data around between neighboring Processing Elements (PEs) can then be done extremely efficiently. The 3D domain is mapped onto the 2D PE map by simply collapsing the 3rd dimension.
The comparisons described in the paper are done at the accelerator level (i.e., one CS-2 or one A100), ignoring any communication with the host. The A100s in TotalEnergies cluster have 40GB of on-device RAM. The first test is a weak scaling study (Weak scaling is defined as the relationship between the solution time and the number of processors for a fixed problem size, expressed per processor): the number of processing elements is increased as the problem size is increased. The test problem runs for 1,000 time steps.
nx | ny | nz | Throughput (Gcell/s) | WSE-2 Time (s) | A100 Time (s) |
200 | 200 | 1000 | 534 | 0.075 | 0.79 |
400 | 400 | 1000 | 2,098 | 0.076 | 3.58 |
600 | 600 | 1000 | 4,732 | 0.076 | 8.00 |
755 | 900 | 1000 | 8,922 | 0.076 | 15.51 |
(nx, y and z are the number of cells in the x, y and z directions)
As can be seen, the time taken by the WSE-2 is constant regardless of the size of the problem, clearly depicting a compute-bound behavior. For the largest size shown here, the WSE-2 outperforms the A100 by more than 220x. The weak scaling efficiency of the WSE-2 is virtually perfect, greater than 98% for all sizes. To seasoned HPC practitioners, these are both astonishing results.
A roofline model analysis is also carried out and confirms that the implementation is indeed compute-bound. The total throughput of the WSE-2 reaches 503 TFLOPs, a remarkable value for a single device node.
The conclusions from this work are very promising for HPC applications on the WSE-2. The authors are currently pursuing more complicated applications, both on the stencil-based side and on a hybridization with Machine Learning (ML) applications, especially given the already proven capacity of the WSE-2 for those workloads. We can’t wait to report on those results.
Learn more
TotalEnergies and Cerebras: Accelerating into a Multi-Energy Future (blog post)
Powering Extreme-Scale HPC with Cerebras Wafer-Scale Accelerators (white paper)
The Cerebras Software Development Kit: A Technical Overview (white paper)
Massively scalable stencil algorithm (Mathias Jacquelin, Mauricio Araya-Polo, Jie Meng, Submitted to SuperComputing 2022)
To schedule a demo, click here.
Related Posts
August 28, 2024
Integrating LLMs and Software Engineering for Self-Refining Copy Creation
Discover how to build an AI agent that generates marketing copy efficiently…
August 28, 2024
ReadAgent: Bringing Gist Memory to AI
Learn how gist memory improves long context handling for large language models.…