Aug 16 2024

Cerebras Wafer-Scale Engine Achieves 210x Speedup Over NVIDIA H100 - Cerebras

Researchers from Rice University and TotalEnergies have introduced a dataflow matrix-free finite volume solver for the Cerebras (WSE) that achieves a 210x speedup over NVIDIA H100 GPU. This type of solver is crucial for fast and accurate numerical simulations, which are essential tools for designing geological carbon capture and storage (CCS) projects that securely store CO2 underground, playing a vital role in mitigating climate change.

Study Overview

The research specifically compared the performance of the Cerebras WSE against traditional GPU-based systems in conducting these complex simulations. These simulations involve solving large, intricate linear systems derived from partial differential equations that govern subsurface fluid flow. Accurately modeling the injection and movement of CO2 in underground formations is vital for ensuring the long-term stability and safety of storage sites. The study aimed to evaluate how well the Cerebras WSE could handle these demanding tasks compared to the latest NVIDIA H100 GPU.

Key Findings

The results revealed that the WSE’s dataflow architecture, which eliminates the memory latency and bandwidth bottlenecks commonly seen in GPUs, allowed it to achieve a 210x performance boost over an H100 GPU. Unlike GPUs, which rely on a hierarchical memory structure that can slow down processing, the WSE’s architecture enables its 850 to work independently with direct access to local memory, significantly increasing computational throughput.

Matrix-Free Algorithm and Dataflow Architecture

The researchers introduced a matrix-free approach to solve finite volume-based linear systems using a dataflow architecture, and implemented with the Cerebras SDK. This approach eliminates the need to store large Jacobian matrices, reducing memory requirements and improving computational speed. The matrix-free method computes each entry of the Jacobian-vector product on the fly, which aligns well with the Cerebras architecture’s strengths in handling localized communications and single-level memory.

Performance on Cerebras CS-2

The implementation on the Cerebras CS-2 system demonstrated substantial performance gains:

- Speedup: The matrix-free FV solver achieved a speedup of up to 210 times compared to a reference implementation on NVIDIA H100 GPUs, thanks to the efficient data communication and computation strategies enabled by the Cerebras architecture.
- Performance: The CS-2 system reached up to 1.217 PFlops on a single node, showcasing its efficiency in handling large-scale simulations.
- Scalability: The system maintained strong scalability across increasing problem sizes, a crucial factor for large-scale CCS simulations.

Algorithmic Enhancements

Several algorithmic enhancements contributed to these performance improvements:

- Asynchronous Communications: Overlapping data movement with computations minimized communication overheads.
- Vectorization: Utilization of vectorized instructions on the Cerebras architecture’s processing elements maximized throughput.
- Memory Optimization: Efficient use of limited local memory on each processing element allowed for larger simulations and reduced memory footprint.

Looking Ahead

Beyond raw processing power, the research highlighted the WSE’s capability to manage the intricate data communication required by CCS simulations efficiently. Traditional GPUs often struggle with the overhead of moving data across cores, leading to performance bottlenecks. In contrast, the Cerebras WSE excels by leveraging localized communication between PEs, which not only speeds up computations but also ensures that the system can maintain high performance as simulation sizes grow.

Looking forward, the researchers suggest further exploration of dataflow architecture’s potential in other complex high-performance computing (HPC) tasks, such as those involving arbitrary mesh topologies. They also propose refining data broadcasting strategies to support a broader range of finite-volume applications. These steps could further establish the Cerebras WSE as a leading solution for high-performance simulations, particularly in areas critical to addressing climate change.

Conclusion

This study underscores the significance of exploring alternative computing architectures like the dataflow model used by Cerebras, especially as the demand for large-scale, high-speed simulations continues to grow in fields like climate science and energy. The findings suggest that the Cerebras WSE could play a pivotal role in advancing CCS simulation capabilities, potentially leading to more effective and timely solutions for climate change mitigation.

Learn more here: https://arxiv.org/pdf/2408.03452

Learn about Total’s recent work on StencilPy for WSE here: https://cerebras.ai/blog/new-tool-generates-stencil-codes-two-orders-of-magnitude-faster-on-cerebras-wse-than-on-gpus

Sign up here to access our SDK: https://cerebras.ai/developers/sdk-request/

To read about our recent SDK research: https://cerebras.ai/blog/supercharge-your-hpc-research-with-the-cerebras-sdk