Publications

May 02, 2025

Don't be lazy: CompleteP enables compute-efficient deep transformers

arXiv, 2025

Nolan Dey*, Bin Claire Zhang*, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness
[arXiv]

February 21, 2025

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

ICLR, 2025

Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness

[OpenReview] [arXiv]

November 18, 2024

Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling

Machine Learning and Compression NeurIPS Workshop, 2024

Esha Singh, Shane Bergsma, Nolan Dey, Joel Hestness, Gavia Gray

[OpenReview] [Poster]

November 01, 2024

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

Gavia Gray, Aman Tiwari, Shane Bergsma, Joel Hestness

[arXiv]

October 31, 2024

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

NeurIPS, 2024

Nolan Dey, Shane Bergsma, Joel Hestness

[arXiv]

October 13, 2024

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, Sean Lie

[arXiv]

September 04, 2024

Bilingual Adaptation of Monolingual Foundation Models

FM-Wild ICML Workshop, 2024

Gurpreet Gosal, Yishi Xu, Gokul Ramakrishnan, Rituraj Joshi, Avraham Sheinin, Zhiming (Charles)Chen, Biswajit Mishra, Natalia Vassilieva, Joel Hestness, Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Onkar Pandit, Satheesh Katipomu, Samta Kamboj, Samujjwal Ghosh, Rahul Pal, Parvez Mullah, Soundar Doraiswamy, Mohamed El Karim Chami, Preslav Nakov

[arXiv] [OpenReview]

June 01, 2024

The practitioner's guide to the maximal update parameterization

Blog & open-source code, 2024

Nolan Dey, Quentin Anthony, Joel Hestness

[Cerebras Blog] [Eleuther AI Blog] [nanoGPT-mup Code]

May 20, 2024

MediSwift: Efficient Sparse Pre-trained Biomedical Language Models

Vithursan Thangarasa, Mahmoud Salem, Shreyas Saxena, Kevin Leong, Joel Hestness, Sean Lie

[arXiv]

May 15, 2024

Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System

Kylee Santos, Stan Moore, Tomas Oppelstrup, Amirali Sharifian, Ilya Sharapov, Aidan Thompson, Delyan Z Kalchev, Danny Perez, Robert Schreiber, Scott Pakin, Edgar A Leon, James H Laros III, Michael James, Sivasankaran Rajamanickam

[arXiv]

May 15, 2024

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz

[arXiv]

November 30, 2023

Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale

Gavia Gray, Anshul Samar, Joel Hestness

[OpenReview]

November 13, 2023

Efficient Algorithms for Monte Carlo Particle Transport on AI Accelerator Hardware

John Tramm, Bryce Allen, Kazutomo Yoshii, Andrew Siegel, Leighton Wilson

[arXiv]

November 08, 2023

Position Interpolation Improves ALiBi Extrapolation

Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, Joel Hestness

[arXiv]

September 26, 2023

Scaling the “Memory Wall” for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems

Hatem Ltaief, Yuxi Hong, Leighton Wilson, Mathias Jacquelin, Matteo Ravasi, David Keyes

[Read the Paper]

September 22, 2023

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Efficient Natural Language and Speech Processing NeurIPS Workshop, 2023

Nolan Dey*, Daria Soboleva*, Faisal Al-Khateeb, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming (Charles) Chen, Bowen Yang, Siyun Li, Abhay Gupta, Shreyas Saxena, Robert Myers, Jacob Robert Steeves, Marvin Tom, Joel Hestness

[arXiv] [Workshop Paper] [Blog] [Hugging Face] [1.08M downloads and 10th most popular text generation model in first month]

August 31, 2023

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing

[arXiv]

May 22, 2023

Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning

April 07, 2023

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

March 22, 2023

Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

Vithursan Thangarasa, Shreyas Saxena, Abhay Gupta, Sean Lie

[arXiv]

March 21, 2023

SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

January 20, 2023

Wafer-Scale Fast Fourier Transforms

November 23, 2022

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

September 28, 2022

Disruptive Changes in Field Equation Modeling: A Simple Interface for Wafer Scale Engines

Mino Woo, Terry Jordan, Robert Schreiber, Ilya Sharapov, Shaheer Muhammad, Abhishek Koneru, Michael James, Dirk Van Essendelft

[arXiv]

Publications

Don't be lazy: CompleteP enables compute-efficient deep transformers

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

Bilingual Adaptation of Monolingual Foundation Models

The practitioner's guide to the maximal update parameterization

MediSwift: Efficient Sparse Pre-trained Biomedical Language Models

Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale

Efficient Algorithms for Monte Carlo Particle Transport on AI Accelerator Hardware

Position Interpolation Improves ALiBi Extrapolation

Scaling the “Memory Wall” for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

Wafer-Scale Fast Fourier Transforms

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

Disruptive Changes in Field Equation Modeling: A Simple Interface for Wafer Scale Engines

Schedule a meeting to discuss your AI vision and strategy.