November 12, 2024
Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling
Large language models have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one-shot pruning often results in significant quality degradation, particularly in tasks…
November 1, 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Per-example gradient norms are a vital ingredient for estimating gradient noise scale (GNS) with minimal variance. Observing the tensor contractions required to compute them, we propose a method with minimal FLOPs in 3D or greater tensor regimes by simultaneously computing the norms while computing the parameter gradients. Using this method we are able to observe the GNS of different layers at higher accuracy than previously possible. We find that the total GNS of contemporary transformer…
October 31, 2024
Sparse maximal update parameterization: A holistic approach to sparse training dynamics
Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs.
October 13, 2024
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
Large language models have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one-shot pruning often results in significant quality degradation, particularly in tasks…
September 4, 2024
Bilingual Adaptation of Monolingual Foundation Models
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its…
July 2, 2024
Bilingual Adaptation of Monolingual Foundation Models
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations.
May 20, 2024
MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Large language models (LLMs) are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine).
May 15, 2024
Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System
Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design.
May 15, 2024
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity.
November 30, 2023
Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs.
November 13, 2023
Efficient Algorithms for Monte Carlo Particle Transport on AI Accelerator Hardware
The recent trend toward deep learning has led to the development of a variety of highly innovative AI accelerator architectures. One such architecture, the Cerebras Wafer-Scale Engine 2 (WSE-2), features 40 GB of on-chip SRAM, making it a potentially attractive platform for latency- or bandwidth-bound HPC simulation workloads.
November 8, 2023
Position Interpolation Improves ALiBi Extrapolation
Linear position interpolation helps pre-trained models using rotary position embeddings (RoPE) to extrapolate to longer sequence lengths. We propose using linear position interpolation to extend the extrapolation range of models using Attention with Linear Biases (ALiBi). We find position interpolation significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks.
September 26, 2023
Scaling the “Memory Wall” for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems
We exploit the high memory bandwidth of AIcustomized Cerebras CS-2 systems for seismic processing. By leveraging low-rank matrix approximation, we fit memoryhungry seismic applications onto memory-austere SRAM waferscale hardware, thus addressing a challenge arising in many wave-equation-based algorithms that rely on Multi-Dimensional Convolution (MDC) operators.
September 22, 2023
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K…
August 31, 2023
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin,…
May 22, 2023
Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning
IEEE Micro Volume 34, Issue 3, focuses on papers from last year's Hot Chips 34 conference. This article describes the Cerebras architecture and how it is designed specifically with this purpose, from the ground up, as a wafer-sized chip to enable emerging extreme-scale ML models. It uses fine-grained data flow compute cores to accelerate unstructured sparsity, distributed static random-access memory for full memory bandwidth to the data paths, and a specially designed on-chip and off-chip…
April 7, 2023
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
We introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-training (highest accuracy for a given compute budget). We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models to show all Cerebras-GPT models have state-of-the-art training efficiency on both pre-training and…
March 22, 2023
Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
Replacing dense layers with Sparse-IFT leads to significant improvements across computer vision (CV) and natural language processing (NLP) tasks, including ResNet-18 on ImageNet (+3.5%) and GPT-3 Small on WikiText-103 (-0.4 PPL), both matching larger dense model variants with 2x or more FLOPs. To the best of our knowledge, this is the first work to demonstrate the use of sparsity for improving accuracy of dense models via a simple-to-use set of sparse transformations.
March 21, 2023
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
Presented at the ICLR 2023 Workshop on Sparsity in Neural Networks. In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in…
January 20, 2023
Wafer-Scale Fast Fourier Transforms
We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections.
November 23, 2022
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics
Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly…
September 28, 2022
Disruptive Changes in Field Equation Modeling: A Simple Interface for Wafer Scale Engines
We present a high-level and accessible Application Programming Interface (API) for the solution of field equations on the Cerebras Systems Wafer-Scale Engine (WSE) with over two orders of magnitude performance gain relative to traditional distributed computing approaches. The domain-specific API is called the WSE Field-equation API (WFA). The WFA outperforms OpenFOAM on NETL's Joule 2.0 supercomputer by over two orders of magnitude in time to solution. While this performance is consistent with…
August 26, 2022
TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer-Scale Engine
The Cerebras Wafer Scale Engine (WSE) is an accelerator that combines hundreds of thousands of AI-cores onto a single chip. Whilst this technology has been designed for machine learning workloads, the significant amount of available raw compute means that it is also a very interesting potential target for accelerating traditional HPC computational codes. Many of these algorithms are stencil-based, where update operations involve contributions from neighbouring elements, and in this paper we…
June 28, 2022
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network
This work introduces the RevSilo, the first reversible module for bidirectional multi-scale feature fusion. Like other reversible methods, RevSilo eliminates the need to store hidden activations by recomputing them. Existing reversible methods, however, do not apply to multi-scale feature fusion and are therefore not applicable to a large class of networks. Bidirectional multi-scale feature fusion promotes local and global coherence and has become a de facto design principle for networks…
April 22, 2022
A Templated C++ Interface for ISL
Polyhedral libraries typically support only a very limited collection of types for representing objects, corresponding to broad mathematical classes such as sets, binary relations and functions.
April 7, 2022
Massively scalable stencil algorithm
Stencil computations lie at the heart of many scientific and industrial applications. Unfortunately, stencil algorithms perform poorly on machines with cache based memory hierarchy, due to low reuse of memory accesses. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the Cerebras WSE-2, which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a…
January 26, 2022
Epigenomic language models powered by Cerebras
Large scale self-supervised pre-training of Transformer language models has advanced the field of Natural Language Processing and shown promise in cross-application to the biological `languages' of proteins and DNA. Learning effective representations of DNA sequences using large genomic sequence corpuses may accelerate the development of models of gene regulation and function through transfer learning. However, to accurately model cell type-specific gene regulation and function, it is necessary…
January 1, 2022
BraggNN: fast X-ray Bragg peak analysis using deep learning
We propose BraggNN, a deep-learning based method, to accelerate the most computation-intensive part of polycrystal diffraction data analysis (diffraction signal characterization). The application of BraggNN for real experimental data demonstrates that it can deliver consistent (sometimes even slightly better) results compared with the conventional method while running hundreds of times faster.
November 19, 2021
Microprocessor at 50. The Path to Successful Wafer-Scale Integration: The Cerebras Story
IEEE Micro Volume 41, Issue 6, took a look back at the first 50 years of the microprocessor, and forward to what's next. It featured this article by Gary Lauterbach, Co-Founder and the Chief Technology Officer of Cerebras Systems, which explores the manifold innovations behind the Cerebras Wafer-Scale Engine, the first commercially successful product cto to reap the performance and efficiency benefits of use wafer-scale integration.
November 17, 2021
Intelligent Resolution: Integrating Cryo-EM with AI-driven Multi-resolution Simulations to Observe the SARS-CoV-2 Replication-Transcription Machinery in Action
The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) replication transcription complex (RTC) is a multi-domain protein responsible for replicating and transcribing the viral mRNA inside a human cell. Attacking RTC function with pharmaceutical com- pounds is a pathway to treating COVID-19. Conventional tools, e.g., cryo-electron microscopy and all-atom molecular dynamics (AAMD), do not provide suffciently high resolution or timescale to capture important dynamics of this molecular…
November 1, 2021
The Path to Successful Wafer-Scale Integration: The Cerebras Story
There has been an impressive increase in single-chip processing power since the Intel 4004 was launched in 1971. This is usually attributed to Moore's law, but there are additional factors to consider. In understanding the components of prior improvements, we can gain insight into the potential for future improvements and potential limits to scaling.
July 5, 2021
Stream-AI-MD: streaming AI-driven adaptive molecular simulations for heterogeneous computing platforms
Emerging hardware tailored for artificial intelligence (AI) and machine learning (ML) methods provide novel means to couple them with traditional high performance computing (HPC) workflows involving molecular dynamics (MD) simulations. We propose Stream-AI-MD, a novel instance of applying deep learning methods to drive adaptive MD simulation campaigns in a streaming manner.
March 5, 2021
Memory Efficient 3D U-Net with Reversible Mobile Inverted Bottlenecks for Brain Tumor Segmentation
We propose combining memory saving techniques with traditional U-Net architectures to increase the complexity of the models on the Brain Tumor Segmentation (BraTS) challenge. The BraTS challenge consists of a 3D segmentation of a 240 240 155 4 input image into a set of tumor classes.
March 1, 2021
Pipelined Backpropagation at Scale: Training Large Models without Batches
New hardware can substantially increase the speed and efficiency of deep neural network training. To guide the development of future hardware architectures, it is pertinent to explore the hardware and machine learning properties of alternative training algorithms.
February 3, 2021
System Integration of Neocortex, a Unique, Scalable AI Platform
The Pittsburgh Supercomputing Center, in partnership with Cerebras Systems and Hewlett Packard Enterprise, has deployed Neocortex, an innovative computing platform that accelerates scientific discovery by vastly shortening the time required for deep learning training and fosters greater integration of deep AI models with scientific workflows.
October 22, 2020
Fast Stencil-Code Computation on a Wafer-Scale Processor
The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes.
October 7, 2020
Fast Stencil-Code Computation on a Wafer-Scale Processor
The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale…
July 7, 2020
The curious case of developmental BERTology: On sparsity, transfer learning, generalization and the brain
In this essay, we explore a point of intersection between deep learning and neuroscience, through the lens of large language models, transfer learning and network compression.
February 22, 2020
Generating SIMD Instructions for Cerebras CS-1 using Polyhedral Compilation Techniques
The Cerebras CS-1 is a computing system based on a waferscale processor having nearly 400,000 compute cores. It is intended for training of and inference on deep neural networks.
February 20, 2020
A Templated C++ Interface for ISL
Polyhedral libraries typically support only a very limited collection of types for representing objects, corresponding to broad mathematical classes such as sets, binary relations and functions. Software built on top of these libraries, on the other hand, needs to deal with a plethora of different kinds of objects such as instance sets, access relations and dependence relations. Conceptually, these different kinds of objects can only be combined in very specific ways, but they are all mapped to…
November 29, 2019
Online Normalization for Training Neural Networks
Polyhedral libraries typically support only a very limited collection of types for representing objects, corresponding to broad mathematical classes such as sets, binary relations and functions.
May 15, 2019
Online Normalization for Training Neural Networks, NeurIPS 2019
Online Normalization is a new technique for normalizing the hidden activations of a neural network. Like Batch Normalization, it normalizes the sample dimension. While Online Normalization does not use batches, it is as accurate as Batch Normalization.