Description
Large language models serve a variety of generative AI use cases. However, they are primarily trained on English datasets that do not possess cultural information to reach a global audience adequately.
This workshop aims to bring together researchers working on building non-English, bi-lingual, and multi-lingual language models. We are interested in all aspects, implications, and challenges of building non-English and multilingual models. This includes the following:
- Dataset sourcing and cleaning
- Best practices for mixing different languages and datasets
- Pre-training and continuous training recipes in data-constrained environments with low-resource languages
- Instruction tuning without instruction datasets available in target languages
- Benchmarking and evaluation of these models in the world where most of the public and commonly used benchmarks are in English
- Alignment with target cultural aspects
As a group, we will share our mistakes and learnings, best practices, and aspirations. We aim to bring together experts in the field, engage in a meaningful dialogue, and foster solutions that promote equity and inclusivity in the AI landscape.
Session I
Neha Sengupta
Principal Applied Scientist, G42
Developing Arabic centric bilingual LLMs
Abstract: This talk will present an overview of our experience of training Jais and Jais-chat, a family of Arabic centric bilingual LLMs. The Jais and Jais-chat models were trained by Core42, Cerebras ,and MBZUAI in collaboration on the Condor Galaxy supercomputer. At 30 Billion parameters, Jais and Jais-chat are the world’s largest and best performing Arabic centric open LLMs.
We will begin by discussing the motivating factors and primary challenges of training Jais, including those of Arabic data collection and processing. We will dive into the supervised fine-tuning data and methodology for building Jais-chat, a bilingual Arabic-centric chat model. We will also discuss our ongoing work and preliminary results on aligning Jais-chat to human preferences. Finally, we will conclude with the ways to access Jais, and the roadmap ahead.
Joel Hestness
Principal Research Scientist, Cerebras Systems
Pretraining the Jais Bilingual Arabic-English Language Models
Abstract: In this talk, we will present details of the pretraining process for Jais models, a series of bilingual Arabic-English language models. Jais models are the current best Arabic models, and they show English capability competitive with state-of-the-art models, even when trained with fewer English tokens. We have scaled Jais models to 13B and 30B parameters.
We will discuss pretraining techniques we combined that result in state-of-the-art Arabic capabilities. Our vocabulary selection process ensures the model can access balanced capability in both Arabic and English. We will describe our use of Maximal Update Parameterization, which simplifies hyperparameter selection leading to predictable model scaling. Our scaling laws tests show we can mix Arabic and English in a 1:2 ratio and achieve near-perfect scaling in both languages.
Jais models are developed through a collaboration between Core42’s Inception, the Mohammed Bin Zayed University (MBZUAI), and Cerebras Systems.
9:50am - 10:10am
Guardrails and Evaluation of the Jais Bilingual Arabic-English Language Models
Abstract: We will discuss the guardrails we developed for the Jais bilingual Arabic-English language models. This was achieved using (i) meticulous data cleansing, (ii) instruction-tuning for safety, (iii) safety prompt engineering, and (iv) developing keyword lists and classifiers to run at inference time. For (ii), we developed a risk taxonomy and examples that cover 5 risk areas, 12 harm types, and 61 specific harm categories.
We will further discuss evaluation. In addition to perplexity, we used downstream evaluation, for Arabic and English, covering world knowledge, commonsense reasoning, and misinformation & bias. For evaluating the model in Arabic, we used pre-existing datasets such as EXAMS, which contained Arabic matriculation questions, we curated our own dataset covering Arabic literature, and we translated English evaluation datasets to Arabic (manually, for MMLU; and automatically, using an in-house system, for the other English datasets). We further performed generation evaluation using GPT-4 as a judge. Yet, whenever feasible, we performed human evaluation, which was the most important input that informed many important decisions about building the model.
Preslav Nakov
Professor at MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)
10:10am - 10:30am
Overview of Japanese Efforts to Train LLMs
Abstract: Like other countries, companies and government agencies in Japan have been investing heavily in LLMs. This talk will give an overview of the recent efforts to pre-train LLMs in Japan in both academia and industry. Details of the corpus, models, and pre-training configuration will be highlighted for some of the largest runs with up to 175B parameters. The first half of the talk will focus on the overview of the institutions involved, funding, and computational resources. The latter half will focus more on the technical details of the pre-training, and issues we are facing.
Rio Yokota
Professor, Tokyo Institute of Technology
10:30am - 11:00am
30-minute Break
Session II
11:00am - 11:30am
GPT-SW3: The first large-scale generative language model for Swedish and the Nordic Languages
Abstract: This talk gives an overview of the process of building GPT-SW3, a multilingual LLM trained on 6 languages. We cover data ingestion and preprocessing, training and evaluation of the model as well as future plans. The focus of the presentation will be on multilingual aspects and challenges.
Felix Stollenwerk
Senior Research Scientist, AI Sweden
11:30am - 12:00pm
Evaluating Language Adaptation Techniques for Mid-Resource Languages
Abstract: Large language models have provided convincing evidence of their significant capabilities, both in downstream tasks and in real-world scenarios. However, low- and mid-resource languages lack the necessary resources to train such models from scratch and often have to rely on multilingual models despite being underrepresented in the training data. This talk describes the experiments that prove that, for mid-resource languages, continued pre-training with vocabulary adaptation is a better alternative resulting in a significant improvement over a random initialization of weights. We perform a comprehensive evaluation to assess the effectiveness of the different techniques to perform continued pre-training from an existing model. Vocabulary adaptation proves to be the most effective technique in terms of performance gains and training efficiency.”
Irene Baucells de la Peña
Research Engineering, Barcelona Supercomputing Center
Joan Llop Palao
Research Engineering, Barcelona Supercomputing Center
12:00pm - 12:30pm
Accelerating Multilingual AI Progress with AYA
Abstract: Aya project is an ongoing collaborative open science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people worldwide. The project was initiated by Cohere For AI as a multi-institutional collaboration with the help of a community of researchers, engineers, linguists, social scientists, and lifelong learners from over 100 countries around the world. In the Aya project, we want to improve available multilingual generative models and accelerate progress for languages across the world. In this talk, I will give a brief overview of AYA, present the work that has been done, and the outcome that might be expected from the project.
Ahmet Ustun
Research Scientist, Cohere AYA
12:30pm - 1:30pm
1 hour Lunch Break
Session III
1:30pm - 2:00pm
Continued Pre-training of LLMs
Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Our vision is to develop methods that can enable efficiently updating pretrained models with new knowledge while preventing forgetting of past knowledge. Taking a step towards efficient continual pre-training, we examine the effect of different warm-up strategies and replay when continuing to pre-train models on new data and new languages.
Kshitij Gupta
Graduate Student, MILA
Ayush Kaushal
Graduate Student, University of Texas-Austin
2:00pm - 2:30pm
Training an instruction tuned and aligned LLM for Indian languages
Abstract: This talk will cover challenges in creating high-quality pre-training, finetuning, and preference optimization datasets. We will share our experiences in training the models and discuss benchmarks and accuracy evaluation. We will also announce availability of the models at a hosted inference endpoint and discuss how the model brings a significant efficiency gain for Indian languages over existing models (including GPT-x).
Pratyush Kumar
Co-Founder, Sarvam.ai
2:30pm - 3:00pm
Towards LLM Generative Diversity: Highly-nuanced and Context-aware Data Generation approaches
Abstract: Large Language Models (LLMs), such as GPT-3.5 have shown their prowess in generating human-like text based on extensive training on vast corpora of data. Understanding ambiguity, particularly in the presence of cultural and linguistic diversity, stands as a formidable task. The same word or phrase can hold distinct meanings based on the underlying socio-linguistic context and cultural underpinnings.
My presentation will discuss our various works in Nigeria on the evaluation of the ability of Large Language Models like GPT3.5 to capture the deep contextual meaning of cultural nuances, with a view to establishing the need to address the current gaps to support the increased local usage of LLMs in health, financial inclusion, agriculture and education in emerging markets.
Olubayo Adekanmbi
Founder/CEO Data Science Nigeria
Closing Remarks
3:00pm - 3:10pm
Closing remarks and conclusion to the Cerebras workshop.
Natalia Vassilieva
Senior Director of Product Management, Cerebras Systems
7:00pm - 11:00pm
Open Source AI Party!
Come enjoy with MILA, Cerebras, Together, and others!
blog
Context is Everything: Why Maximum Sequence Length Matters
GPU-Impossible™ sequence lengths on Cerebras systems may enable breakthroughs in Natural Language Understanding, drug discovery and genomics.
Blog
Cerebras Sets Record for Largest AI Models Ever Trained on Single Device
Our customers can easily train and reconfigure GPT-3 and GPT-J language models with up to 20 billion parameters on a single CS-2 system
Blog
TotalEnergies and Cerebras Create Massively Scalable Stencil Algorithm
TotalEnergies used the Cerebras CS-2 system to turn a problem long accepted to be memory-bound into compute-bound. On a benchmark case inspired by a seismic kernel used to image the Earth, the CS-2 delivered more than 200x performance compared to a NVIDIA® A100 GPU.