Overcoming Timescale Limitations
In collaboration with researchers from Sandia, Lawrence Livermore, and Los Alamos National Laboratories, Cerebras established a new benchmark for simulating materials using molecular dynamics (MD). The Cerebras Wafer-Scale Engine (WSE) runs 800,000-atom simulations 180 times faster than Frontier, the world’s fastest exascale supercomputer.
This was accomplished using the scientifically relevant embedded atom method (EAM) multi-body potential function. Such speed-ups have never been observed on general-purpose processing cores.
According to our colleagues at the national labs, simulations with 100,000 to 1,000,000 atoms are the most critical for scientific work. This scale is sufficient to reproduce relevant phenomena and represent materials large enough for direct observation with electron microscopy.
Scientific Impact
The WSE’s acceleration qualifies it as a new kind of scientific instrument, allowing scientists to predict the future faster. Existing supercomputers can simulate huge numbers of atoms, but slowly, with each timestep taking about a millisecond. The WSE completes a new simulation step every few microseconds.
Atomic simulations investigate materials at incredibly small scales. Typical atomic vibrations have amplitudes around a picometer and periods around ten femtoseconds, necessitating femtosecond-scale timesteps for accurate resolution.
However, femtosecond timesteps challenge the modeling of useful durations. With a millisecond per step, a month-long simulation produces only a few microseconds of material behavior. Simulating a millisecond of material time would take 14 years on a supercomputer. The WSE produces a year’s worth of exascale results every two days, making millisecond-scale simulations feasible.
Many important phenomena only emerge at long (100+ microsecond) timescales:
-
- Annealing of radiation damage in nuclear reactors
- Thermally activated catalytic reactions
- Near-equilibrium phase nucleation
- Protein folding
- Grain boundary evolution
A better scientific understanding of these processes will have wide-ranging implications for engineering and technology. Our demonstrations focused on grain boundary problems due to their relevance to nuclear fusion success.
Grain boundaries, ubiquitous in metals, are regions where atomic crystal lattices of different orientations meet. They profoundly affect material properties like strength, heat tolerance, and corrosion resistance. The slow atomic processes in grain boundaries limit scientific understanding at accessible simulation timescales. In Tokamak fusion reactors, grain boundary evolution in the tungsten plasma-facing components constrains technological progress. Tungsten’s high melting point makes it ideal, but its brittleness, related to grain boundary behavior, severely limits machinability and durability.
Computational Challenges
Parallel computation runtime depends on the processing work per core. Using more cores and assigning less work to each can accelerate an algorithm.
Our work pushed this to the limit by assigning a single atom to each core, stressing communication subsystems in four ways:
- Communication bandwidth: Cores must access data from other cores as atoms interact with nearby atoms. With less work per core, more data must come from the network. At one atom per core, 100% of the force computation data arrives via the network.
- Communication latency: Increased simulation frequency proportionally increases remote data access frequency. Fixed latencies consume more of the runtime, eventually limiting speed.
- Communication granularity: An atom’s relevant information is just its 3D position—a 12-byte tuple of three floats. Core-to-core messaging must efficiently send such small messages.
- Multicast communication: Gathering atom data requires more than point-to-point communication. As each atom’s position is distinct, cores must gather data from distinct, overlapping neighborhoods without congestion-induced latency or point-to-point amplification.
Traditional supercomputers amplify these challenges as they use small chips with chip-to-chip packet-switched networks. These have >2μs base latencies, optimize for large 32KB packets, and have ~1ms tail latencies. On Frontier’s GPUs, this slows down simulations with <200 atoms/core.
WSE Architecture
The WSE processor supports strong scaling with a communication subsystem that provides:
-
- Fabric bandwidth matching floating-point datapath bandwidth
- Single-cycle core-to-core message latency
- Native 32-bit message granularity
- Dataflow marshalling for orchestrating intricate communication patterns
Our implementation centers on a novel “neighborhood multicast” pattern. Each of the 850,000 cores is at the center of its rectangular neighborhood. Per timestep, cores simultaneously multicast their particle data throughout their neighborhood, first horizontally, then vertically.
Cores use the dataflow router to interleave data in a laminar flow. The fabric’s support for fine-grain laminar interleaving avoids congestion. For single-direction transmission, a core multicasts its 12 bytes to downstream cores, then commands the routers to shift the multicast pattern one core right. After several repetitions, all cores have transmitted to their neighborhood.
The horizontal phase uses four threads per core (3.2M threads for 800K atoms)—two for transmission (left/right) and two for reception (left/right)—in just six single-instruction statements. The vertical phase repeats this orthogonally.
See our forthcoming molecular dynamics GitHub repository for more details.
Next Steps
This is our initial implementation. We plan to add more potentials, pursue further acceleration, leverage multi-wafer clusters, and explore hybrid simulation+AI models. Follow our progress and see our arXiv paper for more information.
Related Posts
August 28, 2024
Integrating LLMs and Software Engineering for Self-Refining Copy Creation
Discover how to build an AI agent that generates marketing copy efficiently…
August 28, 2024
ReadAgent: Bringing Gist Memory to AI
Learn how gist memory improves long context handling for large language models.…