Cerebras Systems has partnered with the National Energy Technology Laboratory (NETL) since 2019 to push the boundaries of physical simulation. NETL has previously published work on the WFA, a high-level Python interface for programming the Cerebras Wafer-Scale Engine (WSE). Last year, with the Pittsburgh Supercomputing Center (PSC), NETL ran near real-time computational fluid dynamics (CFD) simulations on PSC’s CS-2 powered Neocortex system. In their latest work, NETL has investigated the potential of Cerebras’s wafer-scale architecture for materials modeling, using the venerable Ising model. Using the CS-2, their work achieves an incredible 88x performance over a highly optimized CUDA code running on an NVIDIA H100.
The Ising Model
The Ising model is a fundamental tool in physics for studying natural phenomena known as phase transitions (a state change from liquid to solid, or the alignment of magnetic dipoles in a material, for instance). In particular, the Ising model allows researchers to study how ferromagnetism, the property which gives materials such as iron permanent magnetism, emerges.
The Ising model consists of a simple material on a lattice, in which each site on that lattice can take one of two “spins”: +1 or -1. One can think of these spins as corresponding to magnetic dipole moments–a material in which spins are aligned will display ferromagnetic behavior.
The system evolves over time, as the spin of one grid point can influence the spin of its neighbors. The energy of the system is given by:
where i and j are sites, Σ<i,j> is a sum over all pairs of nearest neighbors, σi is the spin at site i taking on the value either -1 or +1, and J is a coupling strength.
Notice that if J > 0, then H will be minimized when neighbors have the same spin, i.e., the neighbors both have spin +1 or both have spin -1. In that case, σi σj will be positive, and thus H will be more negative. If J < 0, then H will be minimized when neighbors have different spins.
As the system evolves over time to minimize its energy, structure may emerge in the global system. For instance, in the J > 0 case, the spins, which start out with random orientations, will try to increasingly become aligned in one direction, and the system will begin to exhibit ferromagnetic behavior. However, thermodynamic fluctuations can also cause spins to flip anti-parallel and increase system energy. The rate of these thermodynamic fluctuations is a function of the system’s temperature. If the temperature is low, then these anti-parallel flips will be rare, and the system will exhibit ferromagnetic behavior. If the temperature is high, then these anti-parallel flips are common, and the system remains disordered.
The temperature that sits at the boundary of ferromagnetic and disordered behavior is called the critical temperature, or Tc. This is the temperature at which the phase transition occurs: as the temperature of a system is lowered, the system will begin exhibiting ferromagnetic behavior as it passes this temperature. One can think of this critical temperature like the freezing point of water: as the temperature of water decreases, the water undergoes a phase transition from liquid to solid as it passes 0°C.
A Monte Carlo algorithm is typically used to evolve the material. Given an initial configuration of spins, a site is chosen at random. If flipping the spin of that site decreases the material’s overall energy, then it is flipped. If flipping that spin does not decrease the energy, then it is flipped based on a probability criterion rand()<e–βΔΕ ,where ΔΕ is the change in energy of the system from the flip, and β is 1/(kBT), where kB is Boltzmann’s constant and T is the temperature.
The model also typically has periodic boundary conditions: a site on the bottom of the lattice will treat the site at the top of its column as its southern neighbor, and so on.
The Cerebras WSE
The Cerebras WSE is a wafer-parallel compute accelerator, containing hundreds of thousands of independent processing elements (PEs). The WSE-2 chip has 850,000 cores, and the WSE-3 chip features a massive 900,000 PEs. The PEs are interconnected by communication links into a two-dimensional rectangular mesh on one single silicon wafer. Each PE has its own 48 kB of memory, containing both instructions and data. 32-bit messages, called wavelets, can be sent to or received by neighboring PEs in just a couple clock cycles. Wavelets travel the fabric along a virtual channel, called a color. For each color used for incoming wavelets, a chunk of executable code known as a task may be activated by its arrival. Thus, the PE has dataflow characteristics: a program’s state is determined by the flow of data.
The extremely low latency fabric connecting the PEs has a massive aggregate peak fabric bandwidth of around 220 Petabits per second. Additionally, the 48 kB of memory owned by each PE is accessible in a single cycle. More specifically, each PE can read 128 bits from and write 64 bits to memory each cycle, resulting in an aggregate peak memory bandwidth of more than 20 Petabytes per second. Wafer-scale computing means not only unprecedented scaling, but unprecedented memory bandwidth, making more algorithms compute-bound than ever. The single-cycle memory accesses and ultra-fast fabric presents a totally new paradigm for HPC applications.
Implementing the Ising Model
A common strategy for implementing the Ising model is the checkerboard decomposition. We can consider the lattice of spins as a checkerboard, in which neighboring spins are assigned a different color. We can use this checkerboard pattern to determine which sites we choose to update during a step. We update all the sites on one color, and then update all the sites on the other color. This produces a very natural way to parallelize updates. We can update all the red sites in parallel, based only on the values of the blue sites. Similarly, we can update all the blue sites in parallel, based only on the values of the red sites.
The team at NETL implemented the Ising model for the WSE-2 using the WFA toolset. To efficiently utilize the massive WSE, the team at NETL implemented the checkerboard decomposition of the Ising model on a single column of PEs. One axis of spins in the 2D Ising model was distributed along this column, while the other axis was held in the PE’s memory. Because each simulation uses only a single column of PEs, many simulations can be run simultaneously on different columns. A single 16-bit word holds 16 spins. Additionally, the implementation folds the domain across this column, so that the boundaries of the lattice are only one PE away: this allows the periodic boundary conditions to be computed without communicating across the entire length of the wafer.
Ising model layout on the WSE. One axis of spins is distributed along a column of PEs, while the other axis is stored in the PE’s memory. Different simulations can be run in parallel on different columns of PEs. The number in each box refers to the spin index. Red spins are updated in parallel based on the values of the blue spins, and vice versa.
NETL’s Results
To compare performance between the WSE and NVIDIA GPUs, the team considered two metrics. The first, flip rate, measures the number of spin flips achieved on device per unit of time. The second, productivity, measures the cumulative device time per Monte Carlo iteration to run a fixed number of simulations.
The WSE running 754 simultaneous simulations achieved a maximum flip rate of 61,854 flips/ns. This compared to 420.8 flips/ns on a V100 and 880.6 flips/ns on an H100 running highly optimized CUDA code, speedups of 148x and 70x, respectively. Comparing productivity, WSE achieved a device time per iteration 88x faster than the H100 when running 754 total simulations on each device.
(a) Flip rate in flips/ns measured on the WSE and on V100, A100, and H100 GPUs. The fold number for the WSE simulations refers to how many times the domain is folded, a strategy to keep the periodic boundaries on neighboring PEs. (b) Productivity metric, defined as the cumulative device time per iteration to run Nsim simulations, as a function of number of spins in the simulation.
Considering these results, NETL researcher Dirk van Essendelft stated, “We were pleasantly surprised by the 88x productivity gain we measured in this work because this problem is exceptionally low in data intensity relative to what we have studied in the past. The low data intensity makes it an ideal algorithm for high performance on traditional architectures like GPUs and CPUs as optimized codes can make use of caching effectively. Because of this, it is also reasonable to believe that this will be amongst the lowest figures of merit achieved in lattice/grid based HPC ports. This gives us great courage to continue pushing the bounds of what is possible in WSE programming for statistical physics problems.”
These results highlight the incredible potential of the Cerebras architecture for materials modeling and physical simulation. In the future, the team at NETL plans to extend this work to the 3D Ising model and simulation of quantum materials.
Learn More
To read more, check out the preprint on arXiv.
To learn about more record-breaking work in materials modeling, check out our blog post on implementing the embedded atom method for molecular dynamics with Sandia National Laboratory.
To learn about other recent HPC research with Cerebras, and information on our publicly available SDK, check out this blog post.
Related Posts
August 28, 2024
Integrating LLMs and Software Engineering for Self-Refining Copy Creation
Discover how to build an AI agent that generates marketing copy efficiently…
August 28, 2024
ReadAgent: Bringing Gist Memory to AI
Learn how gist memory improves long context handling for large language models.…