Mar 12 2024

Cerebras CS-3: the world’s fastest and most scalable AI accelerator - Cerebras

Today Cerebras is introducing the CS-3, our third-generation wafer-scale AI accelerator purposely built to train the most advanced AI models. With over 4 trillion transistors – 57x more than the largest GPU – the CS-3 is 2x faster than its predecessor and sets records in training large language and multi-modal models. The CS-3 is built to scale: using our next generation SwarmX interconnect, up to 2048 CS-3 systems can be linked together to build hyperscale AI supercomputers of up to a quarter of a zettaflops (10^21). The CS-3 can be configured with up to 1,200 terabytes of external memory – allowing a single system to train models of up to 24 trillion parameters, paving the way for ML researchers to build models 10x larger than GPT-4 and Claude. The CS-3 is shipping to customers today. Condor Galaxy 3 – the first CS-3 powered AI supercomputer built in collaboration with our partner G42, will be operational in Q2 2024.

Since the advent of the microprocessor in 1972, the semiconductor industry has abided by Moore’s Law. Every processor – CPU, GPU, or ASIC – has followed the trend of doubling transistor count approximately every two years. The introduction of the Wafer Scale Engine in 2019 by Cerebras shattered a fifty-year industry law and created a new class of processors for AI and HPC workloads. The CS-3 is our third generation wafer-scale accelerator, continuing a scaling law that to this day to has not been replicated by any other technology company.

CS-3: Twice the speed at same power & cost

The Cerebras CS-3 was designed to accelerate the latest large AI models. Each CS-3 core has 8-wide FP16 SIMD units, a 2x increase over CS-2. We’ve boosted performance for non-linear arithmetic operations and increased memory and bandwidth per core. In real world testing using Llama 2, Falcon 40B, MPT-30B, and multi-modal models, we measured up to 2x tokens/second vs. the CS-2. While new GPUs consume more than doubled power and cost generation to generation, CS-3 doubles performance with no increase in power or cost, greatly improving total cost of ownership.

Scalability

Large language models such as GPT-4 and Gemini are growing in size by 10x per year. To keep up with ever escalating compute and memory requirements, we’ve dramatically increased the scalability of our clusters. While CS-2 supported clusters of up to 192 systems, CS-3 supports clusters of 2048 systems – a 10x improvement. A full cluster of 2048 CS-3s delivers 256 exaflops of AI compute and can train Llama2-70B from scratch in less than a day. In comparison, Llama2-70B took approximately a month to train on Meta’s GPU cluster. In addition, thanks to Cerebras’s unique Weight Streaming architecture, the entire cluster looks and programs like a single chip, greatly simplifying the daunting task of distributed computing.

Unlike GPUs, Cerebras Wafer Scale Clusters de-couple compute and memory components, allowing us to easily scale up memory capacity in our MemoryX units. Cerebras CS-2 clusters supported 1.5TB and 12TB MemoryX units. With the CS-3, we are dramatically increasing the MemoryX options to include 24TB and 36TB SKUs for enterprise customers and 120TB, and 1,200 TB options for hyperscalers. The 1,200 TB configuration is capable of storing models with 24 trillion parameters – paving the way for next generation models an order of magnitude larger than GPT-4 and Gemini.

A single CS-3 can be paired with a single 1,200 TB MemoryX unit, meaning a single CS-3 rack can store more model parameters than a 10,000 node GPU cluster. This lets a single ML engineer develop and debug trillion parameter models on one machine, an unheard of feat in GPU land.

Condor Galaxy 3: Exa-scale Performance, Single Device Simplicity

The first AI supercomputer to be built using the CS-3 is the Condor Galaxy 3 (CG-3), the third supercomputer built in collaboration between G42 and Cerebras. Powered by 64 CS-3 systems, the 8 exaflops CG-3 doubles the compute capacity of the CG-2 with no increase in footprint or power. Unlike GPU clusters with tens of thousands of chips and complex memory hierarchies, the CG-3 presents itself as a single processor with a single unified memory to the ML developer. It is the only AI supercomputer that looks and programs like a single device. CG-3 will be built in Dallas, Texas and comes online Q2 2024.

Cerebras + Qualcomm For 10x Cheaper Inference

Cerebras has partnered with Qualcomm to develop a joint AI platform for training and inference. Models trained on the CS-3 using our unique architectural features such as unstructured sparsity can be accelerated on Qualcomm AI 100 Ultra inference accelerators. In aggregate, LLM inference throughput is up to 10x faster. For additional details, see the blog on our collaboration.

The Cerebras CS-3 sets a new benchmark in large scale AI performance. By providing exa-scale performance in a single logical device, CS-3 based clusters provides the simplest and fastest way to build next generation AI models. The CS-3 is shipping today to select customers. To try the CS-3 via the Cerebras cloud, please reach out to our customer team.