Update February 2023:
The National Energy Technology Laboratory, Pittsburgh Supercomputing Center and Cerebras just announced some exciting (and beautiful) results that extend this work to a complete computational fluid dynamics (CFD) application.
The video above shows a high-resolution simulation of Rayleigh-Bénard convection, which occurs when a fluid layer is heated from the bottom and cooled from the top. These thermally driven fluid flows are all round us – from windy days, to lake effect snowstorms, to magma currents in the earth’s core and plasma movement in the sun.
As the narrator says, it’s not just the visual beauty of the simulation that’s important: it’s the speed at which we’re able to calculate it. For the first time, using our Wafer-Scale Engine, NETL is able to manipulate a grid of nearly 200 million cells in nearly real-time.
By transforming the speed of CFD, which has always been a slow, off-line task, we can open up a whole raft of new, real-time use cases for this, and many other core HPC applications. To quote the video again:
“More compute power, more experiments, better science!”
In concert with Cerebras, researchers at the National Energy Technology Lab (NETL) have just posted a paper to arXiv reporting on a simple Python API that will enable wafer-scale processing for much of computational science, achieving gains in performance and usability that cannot be obtained on conventional computers and supercomputers. The potential is there to change the way computers are used in engineering in a positive and basic way. We therefore titled the paper Disruptive Changes in Field Equation Modeling: A Simple Interface for Wafer Scale Engines. {Read the press release.)
What NETL and Cerebras had already done: performance
Let me back up a tad. In 2020, our work together produced a remarkable result. On the Cerebras Wafer-Scale Engine (the WSE), we were able to solve a key component of NETL’s computational models in roughly 200 times less time than what could be achieved on the Joule supercomputer. And at more than a thousand times better energy efficiency.
The methods and the reasons for that performance advantage were explained in a paper at the Supercomputing conference, SC20. Briefly, we showed that the nearly one million processing elements (PEs) on the WSE could be effectively used to solve problems with billion-node meshes for modeling field equations, at very high utilization and efficiency. This is a very fine computation grain, of about one thousand mesh points per PE. Conventional clusters need much coarser grained parallel work chunks because while they have multi-teraFLOP processors, the sluggish network connecting them forces them is distribute the field into large chunks.
The result is that the WSE can solve problems extremely fast. Fast enough to allow real-time, high-fidelity models of engineered systems of interest. It’s a rare example of successful “strong scaling”, which is the use of parallelism to reduce solve time with a fixed size problem. Conventional HPC clusters “weak scale” well, and this lets them solve bigger problems, but it does not let them solve problems faster. This is a problem, as we shall see.
And now we can do it in Python
Okay, so what’s in our new paper? It’s about how we program the wafer for work in scientific computing. While for work in AI, we have long allowed users to describe the network in the high-level frameworks PyTorch or TensorFlow, that hasn’t been true for other use cases. To take on other work, new low-level code has had to be written. This low-level code is local, i.e. a copy runs on each PE. It sends and receives messages through the on-wafer network. And currently it uses a language that exposes details of the architecture of the WSE and its communication mechanisms. Low-level code takes a bit of time to learn, and developing applications using it can be a painstaking exercise. This has all changed with the new approach that our paper documents. It is a domain-specific, high-level programmer’s tool set called the WSE Field-equation API, or WFA.
The WFA: What is it and how is it used?
The leader of the NETL effort is Dr. Dirk Van Essendelft. Dirk recognized that he could build something quite easy to use that would allow an engineer to program computations on large meshes, with local communication between mesh points. These mesh computations are the way we solve the field equations of mathematical physics, which are the differential equations that model engineered or natural systems. Building, running, and interpreting the results of such models is a mainstream activity at NETL and many other places.
Dirk defined and coded (in our low-level language) a small set of communication and computation primitive operations that work across the whole wafer and operate on whole, wafer-resident data arrays. They can be invoked from Python. This, the WFA, gives someone needing to build a new model a programming toolkit that is easy to think about, code, and use. It closely resembles how a programmer in Python, or NumPy, or MATLAB would think and program. The WFA is essentially a new language specifically intended for solving field equations, which is the domain that NETL works in.
Furthermore, the NETL team created a very clever implementation. They developed a way to program in that style, compile the code, and download a representation of the compiled code to the wafer. Once on the wafer, the compiled code can be interpreted by a single control PE. The control PE then broadcasts low level commands to a whole array of worker PEs, which in turn use their stored low-level code to perform their parts, compute and communicate, of the whole array operation. (The on-wafer network is very good at broadcast, by the way, with nanosecond latency between adjacent PEs.) And the implementation is almost as fast as those one can create by writing low level code directly. This, in itself, is an impressive result. High-level languages tend to sacrifice efficiency in the name of programming convenience.
Here’s a sample of what a WFA application looks like at the Python level.
from WSE_FE.WSE_Interface import WSE_Interface
from WSE_FE.WSE_Array import WSE_Array
from WSE_FE.WSE_Loops import WSE_For_Loop
import numpy as np
# Instantiate the WSE Interface
Wse = WSE_Interface()
# Define constants
c = 0.1
center = 1.0 - 6.0 * c
# Create the initial temperature field and BCs
T_init = np.ones((102, 102, 102))*500.0
T_init[1:-1, 1:-1, 0] = 300.0
T_init[1:-1, 1:-1, -1] = 400.0
# Instantiate the WSE Array objects needed
T_n = WSE_Array(name=‘T_n’, initData=T_init)
# Loop over time
with WSE_For_Loop(‘time_loop’, 40000):
T_n[1:-1, 0, 0] = center * T_n[1:-1, 0, 0]\
+ c * (T_n[2:, 0, 0] + T_n[:-2, 0, 0]
+ T_n[1:-1, 1, 0] + T_n[1:-1, 0, -1]
+ T_n[1:-1, -1, 0] + T_n[1:-1, 0, 1])
Wse.make_WSE(answer=T_n)
The WFA neatly solves another problem: code size. The wafer has a pretty large amount of memory (about 40 GB), but each PE is small, with 48 kilobytes of memory. And a PE’s code resides in its memory, sharing that space with the data. Yet many applications are large. The NETL approach fixes this problem by dedicating some PEs as code storage. The downloaded code is distributed across these tens or even hundreds of dedicated PEs out of the nearly one million on the wafer. They then serve as a code store supplying code to the control PE on demand.
What does this mean for computational science?
We have demonstrated, both through the NETL/Cerebras collaboration and a Total Energies/Cerebras effort that will be reported on at SC22, that on-the-wafer field equation solutions can be two OoM faster and three OoM more energy efficient than on conventional compute clusters. These remarkable results are all due to the trifecta of great memory bandwidth, great interconnect bandwidth, and amazing (one clock cycle) communication latency and injection rate for small messages. We have shown that through wafer-scale computing, and in no other way today, engineers can strong scale and achieve a dramatic reduction in the time to find answers for problems of sizes relevant in engineering practice.
That speed can lead to major breakthroughs. It can allow a “digital twin” to run alongside, or even ahead of, an industrial process. By simulating the equipment, driven by measured inputs, and able to help control and optimize that equipment in real time, we can improve efficiency or ensure safety in cases of unusual behavior.
It can allow designers to explore in great detail the options they have in designing a system, a turbine blade, and airfoil, a reactor chamber. The designer can rapidly simulate the operation of that system across a great many possibilities and use these simulations to find optimum designs. Today, because simulation is so slow, only a tiny fraction of the ideal number of simulations can be performed.
The use of simulations to make predictions and drive policies is key, for example in climate, in safety certification, and other cases. In addition to the simulation, scientists now engage in “uncertainty quantification”, UQ for short, which measures how much we can trust the simulation. It puts an error bar around the computed results. A standard approach is to run many simulations, making small changes to inputs or other factors, and measure the distribution of computed results. Again, as in the case of design optimization, a great many simulations must be done. So fast solution is the key to routine use of UQ for critical computational models.
These and other uses drive the need for fast, real-time, or even faster-than-real-time simulation. Strong scaling on the WSE is the way to get this today.
The performance achieved, and why a wafer can do this
Cerebras has developed wafer-scale processors, packaged them into systems and equipped the systems with application development software. We’ve aimed those systems, with great success, at the most difficult problems at the frontier of artificial intelligence: training the largest deep neural networks, networks like GPT-3. These successes are born of the unique characteristics of wafers in general, the Cerebras Wafer-Scale Engine (the WSE) in particular, and also on the WSE’s unique architecture. Our architecture has been optimized for high performance compute on the sort of regular, data-intensive work (rather than control-intensive work) that characterizes neural networks as well as many areas of computational science.
What characteristics are these? Wafers excel at data communication, because moving data across space – from PE to memory, PE to network, PE to other PE – is where most of the energy goes in computing. Power is speed times energy, and power supply (and cooling, which is the inevitable consequence of power supply) is limited by the physics and engineering of compute hardware. So lowering energy allows for more speed at constant power.
Energy: Diving deeper
Energy in data movement is expressed in units of energy per bit moved, the unit being the picojoule of energy per bit. A picojoule is one trillionth of a joule, and in one kWh (about a dime’s worth of electricity) there are 3.6 million joules. So we are talking a very, very small amount of energy.
So why worry? Because we have to move a lot, and I mean a lot, of bits to do computing.
On the WSE we are moving a hundred quadrillion (that’s ten to the seventeenth power) bits per second. If you can cut the energy to move a bit down from one to one-tenth of a picojoule, you can do ten times the work with the same power. And that is why a big wafer, sixty times the size of large CPU or GPU, is so powerful. We put all the processing, memory, and network needed for the performance of a cluster of tens or hundreds of CPU or GPU-based nodes onto one wafer and keep all that communication internal to it.
We don’t need to drive signals over macroscopic wires (it takes a lot of energy to be heard at the far end of such a thing) or optical cables (it still takes energy to drive the high-speed electro-optical conversions at the endpoints). So we can cut the energy per bit down far below one picojoule. We do that in a system requiring tens of kilowatts. If the Cerebras system were saddled with conventional picojoule cost signaling, it would use ten times as much energy, putting it in a regime where the power and cooling infrastructure, and the system size, and the annual cost to operate would all grow substantially.
But it’s more than money. In addition to the cost, data movement off chip is very slow. We measure data movement capability in terms of both bandwidth (how many bits can you move per second) and latency (how long does it take for the data to get to the destination). For our applications, bandwidth is the key. Off-chip bandwidth is constrained by power, but it is also constrained by the number of “pins”, which are the wires that carry signals into and out of a chip. Despite years of development that has increased the pin count substantially, the number relative to the amount of compute hardware in the chip has dropped, and the relative bandwidth, in bits you can move per operation you can perform internally, has gotten steadily less. This limits the rate of access to memory by processors. The complex cache hierarchies we see today were invented to cope with the low bandwidth to memory. And this low bandwidth also limits the speed of processor-to-processor message communication. Applications have to cope, with significant difficulty, with these “walls”, the memory wall and the message passing wall. The problems are fundamental; they force large problem granularity and weak scaling; and progress in strong scaling has stalled.
What’s on the wafer? Why this matters.
The architecture, what we put on the wafer, is a key to exploiting the wafer’s advantages and achieving performance. Neural networks and much computational science require simple, repetitive calculation over very big arrays of data. The compute can happen in parallel, working on the elements of these arrays all at the same time. And to make a dramatic speedup, you need to exploit that parallelism, and provide a very large number of parallel processing units. These units do not need to be huge; each of them can work with small subarrays of the whole data arrays. But they do need high memory bandwidth. Caches are of little value to these workloads, which touch all of the array data once before they touch any of it again. Infrequent date reuse is anathema to the performance of a cache hierarchy. And these calculations need to move data between processors, often, at high bandwidth and low latency, in parallel. They need what the wafer offers. So Cerebras architected a “compute fabric” for the wafer. The fabric has
- Nearly one million identical PEs
- A messaging system linking every PE to its own router and linking the routers into a mesh, with single-cycle communication latency and one 32-bit word per cycle bandwidth
- Communication operations (move a scalar or a vector and use either as an operand) built into the instruction set and implemented entirely in hardware
- Strictly local memory, without cache, in SRAM, with two 32-bit word reads and one write per cycle and single cycle latency, at every PE
These features eliminate the memory wall for on-wafer data, and they allow extremely low latency and high bandwidth communication. The parallelism, at the million PE level, can be exploited without resort to enormous problem sizes (e.g. enormous matrices or meshes), as is done on conventional clusters of CPU and GPU hardware.
If you’d like a finer-grained look at our architecture, I recommend this blog by my colleague Sean Lie.
In conclusion
For the range of important problems that fit into our Wafer-Scale Engine, we’ve shown that speedups can be found that make possible revolutionary use cases that are intractable on any other available hardware platform. The new development, the WFA, now offers easy and simple ways to write code for the WSE, with modest and affordable programming effort.
Rob Schreiber, Distinguished Engineer | November 10, 2022