event
Cerebras at NeurIPS 23
Booth #1121
Sun, Dec 10, 2023 – Sat, Dec 16, 2023
The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023) is an interdisciplinary conference that brings together researchers in machine learning, neuroscience, statistics, optimization, computer vision, natural language processing, life sciences, natural sciences, social sciences, and other adjacent fields.
000 days 00 hours 00 minutes 00 seconds
BTLM-3B-8K at ENLSP Workshop
Our paper “BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model” has been accepted at the Efficient Natural Language and Speech Processing (ENLSP-III) workshop. This paper provides detailed descriptions of our learnings from training state-of-the-art LLMs, including our experience with the following techniques:
-
- Rotary and ALiBi position embeddings
- Swish-gated linear unit (SwiGLU)
- Overtraining on many Tokens-Per-Parameter
- LR Decay Ratio
- Maximal update parameterization (muP)
Efficient Natural Language and Speech Processing (ENLSP-III) workshop will focus on the future of large language models and their emerging applications on different domains such as natural language, speech processing, and biological sequences; and the target is on how to make them more efficient in terms of Data, Model, Training, and Inference for real-world applications as well as academic research.
Learn more about the workshop here: https://neurips2023-enlsp.github.io/
Gradient Noise Scale at WANT Workshop
Our paper, Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale has been accepted by the Workshop on Advancing Neural Network Training (WANT).
The gradient noise scale is valuable to compute because it provides a suggestion for a compute efficient batch size when training a deep learning model. However, computing it can be awkward or expensive depending on the approach taken due to difficulty obtaining small batch gradient norm estimates. In this paper, we present a Scaled Output Gradient Noise Scale (SOGNS) that is generally applicable at negligible cost and provides additional feedback to the practitioner during training.
The WANT workshop will provide researchers with the tools necessary to train neural networks at scale. It will provide an interactive platform for researchers and practitioners to delve into the latest advancements in neural network training.
Learn more about this workshop here: https://want-ai-hpc.github.io/
Sparse Iso-FLOP Transformations at WANT Workshop
Our paper “Sparse Iso-FLOP Transformations for Maximizing Training Efficiency” has been accepted by the Workshop on Advancing Neural Network Training (WANT).
Researchers at Cerebras developed a new simple-to-use framework called Sparse-IFT which can enhance the accuracy of deep neural networks without increasing computational FLOPs at training and inference time. Instead of sacrificing accuracy, this approach uses weight sparsity to boost it, making training more efficient. We show significant improvements on popular computer vision and natural language processing tasks without changing any of the standard training hyperparameters besides the sparsity level
The WANT workshop will provide researchers with the tools necessary to train neural networks at scale. It will provide an interactive platform for researchers and practitioners to delve into the latest advancements in neural network training.
Learn more about this workshop here: https://want-ai-hpc.github.io/
The 1st Multilingual Model Workshop, hosted by Cerebras
Data, models, and methods to train models beyond English
Large language models serve a variety of generative AI use cases. However, they are primarily trained on English datasets that do not possess cultural information to reach a global audience adequately.
This workshop aims to bring together researchers working on building non-English, bi-lingual, and multi-lingual language models. We are interested in all aspects, implications, and challenges of building non-English and multilingual models. This includes the following:
-
- Dataset sourcing and cleaning
- Best practices for mixing different languages and datasets
- Pre-training and continuous training recipes in data-constrained environments with low-resource languages
- Instruction tuning without instruction datasets available in target languages
- Benchmarking and evaluation of these models in the world where most of the public and commonly used benchmarks are in English
- Alignment with target cultural aspects
As a group, we will share our mistakes and learnings, best practices, and aspirations. We aim to bring together experts in the field, engage in a meaningful dialogue, and foster solutions that promote equity and inclusivity in the AI landscape.
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Nolan Dey∗, Daria Soboleva∗, Faisal Al-Khateeb, Bowen Yang, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming (Charles) Chen, Robert Myers, Jacob Robert Steeves, Natalia Vassilieva, Marvin Tom, Joel Hestness
We study recent techniques targeted to improve the parameter efficiency and modeling quality of large language models (LLMs). We experiment with recentlyproposed training approaches, such as overtraining for a large number of tokens-perparameter on a high-quality dataset, carefully tuning hyperparameters with maximal update parameterization (µP), and adjusting learning rate and batch size. We also test recent state-of-the-art model features, namely, rotary and ALiBi position embeddings, and the Swish-gated linear unit (SwiGLU). We find a pretraining recipe that improves over Cerebras-GPT µP validation loss by 12.7% for the same parameter budget. With this recipe, we train the state-of-the-art 3B parameter foundation model, called the Bittensor Language Model (“BTLM-3B-8K”), which is sized to deploy easily on memory or compute-constrained devices. Over a broad set of downstream tasks, BTLM beats all other 3B foundation models by 2-5.5%, making it competitive with some 7B parameter models that are 2.5× larger. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/ btlm-3b-8k-base.
Position Interpolation Improves ALiBi Extrapolation
Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, Joel Hestness
Linear position interpolation helps pre-trained models using rotary position embeddings (RoPE) to extrapolate to longer sequence lengths. We propose using linear position interpolation to extend the extrapolation range of models using Attention with Linear Biases (ALiBi). We find position interpolation significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks.
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model —the foundation Jais model, and an instruction-tuned Jais-chat variant— with the aim of promoting research on Arabic LLMs.
Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
The gradient noise scale is valuable to compute because it provides a suggestion for a compute efficient batch size when training a deep learning model. However, computing it can be awkward or expensive depending on the approach taken due to difficulty obtaining small batch gradient norm estimates. “Efficient” per-example gradient norms provide accurate small batch gradient norms but are inefficient in transformer or convolutional models. By assuming activations are normally distributed, we compute an approximate per-example gradient norm that tracks the true per-example gradient norm in practical settings. Using this approximation, we construct a Scaled Output Gradient Noise Scale (SOGNS) that is generally applicable at negligible cost and provides additional feedback to the practitioner during training.
Sparse Iso-FLOP Transformations for Maximizing Training Efficiency
Vithursan Thangarasa *, Shreyas Saxena*, Abhay Gupta, Sean Lie
Recent works have explored the use of weight sparsity to improve the training efficiency (test accuracy w.r.t training FLOPs) of deep neural networks (DNNs). These works aim to reduce training FLOPs but training with sparse weights often leads to accuracy loss or requires longer training schedules, making the resulting training efficiency less clear. In contrast, we focus on using sparsity to increase accuracy while using the same FLOPs as the dense model and show training efficiency gains through higher accuracy. In this work, we introduce Sparse-IFT, a family of Sparse Iso-FLOP Transformations which are used as drop-in replacements for dense layers to improve their representational capacity and FLOP efficiency. Each transformation is parameterized by a single hyperparameter (sparsity level) and provides a larger search space to find optimal sparse masks. Without changing any training hyperparameters, replacing dense layers with Sparse-IFT leads to significant improvements across computer vision (CV) and natural language processing (NLP) tasks, including ResNet-18 on ImageNet (+3.5%) and GPT-3 Small on WikiText-103 (-0.4 PPL), both matching larger dense model variants that use 2x or more FLOPs. To our knowledge, this is the first work to demonstrate the use of sparsity for improving the accuracy of dense models via a simple-to-use set of sparse transformations. Code is available at: https://github.com/ CerebrasResearch/Sparse-IFT.
blog
Context is Everything: Why Maximum Sequence Length Matters
GPU-Impossible™ sequence lengths on Cerebras systems may enable breakthroughs in Natural Language Understanding, drug discovery and genomics.
Blog
Cerebras Sets Record for Largest AI Models Ever Trained on Single Device
Our customers can easily train and reconfigure GPT-3 and GPT-J language models with up to 20 billion parameters on a single CS-2 system
Blog
TotalEnergies and Cerebras Create Massively Scalable Stencil Algorithm
TotalEnergies used the Cerebras CS-2 system to turn a problem long accepted to be memory-bound into compute-bound. On a benchmark case inspired by a seismic kernel used to image the Earth, the CS-2 delivered more than 200x performance compared to a NVIDIA® A100 GPU.