Key Insights from the 1st Multilingual Workshop - Cerebras

At NeurIPS 2023, we kicked off our inaugural workshop on multilingual models, where experts and enthusiasts alike converged to exchange insights and progress in the realm of multilingual models.

Research experts from around the world joined to speak on key areas of multilingual model development, including 1) dataset sourcing, 2) mixing languages in datasets, 3) pre-training in data-constrained environments, 4) instruction tuning without instruction datasets in target languages, and 5) benchmarking in a predominantly English-centric field. The workshop also emphasized the importance of cultural alignment in model development.

This blog provides a detailed overview of the workshop and includes links to each of the talks. Additionally, we include links to the relevant talks for each topic pertaining to multilingual model development.

Dataset Gathering, Sourcing, and Cleaning Methodology

Neha Sengupta from G42 shed light on the challenges of gathering sufficient Arabic data for their large-scale models. She detailed the rigorous process of dataset compilation and cleaning, underscoring the disparity between the abundance of English data and other languages like Arabic. Olubayo Adekanmbi highlighted the importance of culturally rich datasets, particularly in the African context, emphasizing the need for data that captures the nuances of local languages and cultures. Pratyush Kumar from Sarvam.ai covered challenges in creating high-quality pre-training, finetuning, and preference optimization datasets for Indian languages. Sampo Pyysalo from TurkuNLP explored the challenges involved in curating datasets for Finnish, a language spoken by less than 0.1% of the world population.

Best Practices for Mixing Different Languages and Datasets

Joel Hestness from Cerebras delved into the intricacies of mixing datasets of different languages. Hestness shared insights on balancing Arabic and English datasets. Ann and Joan from the Barcelona Supercomputing Center focused on the challenges of developing models for Spanish and Catalan, highlighting the significance of considering cultural nuances in data sourcing. Finally, Felix Stollenwerk of AI Sweden provided a deep dive into the tokenization process of a multilingual LLM trained in 6 Nordic languages.

Pre-training and Continuous Training Recipes

Kshitij Gupta from MILA and Ayush Kaushal from the University of Texas-Austin discussed the continuous training of Indian language models, focusing on efficiently integrating new data into existing models without causing catastrophic forgetting. They highlighted the need for a lifelong learning system for language models, adapting as new data becomes available. Joel Hestness also touched upon pre-training methods, discussing the significance of Maximal Update Parameterization for stable training as models scale up. Finally, Rio Yokota from the Tokyo Institute of Technology shared technical details and challenges with pre-training Japanese LLMs up to 175B parameters. Sampo Pyysalo describes two different approaches to pretraining an LLM for low-resource languages, including 1) training seven monolingual models from scratch (186M to 13B parameters) and 2) continuing the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish.

Instruction Tuning Without Instruction Datasets

Preslav Nakov and Ahmet Üstün explored the complexities of instruction tuning in the absence of instruction datasets in target languages. Nakov emphasized the process of machine translating English datasets to create native Arabic evaluation datasets, while Üstün highlighted the community-driven effort to collect multilingual data, enhancing the quality and representation of underrepresented languages.

Benchmarking and Evaluation

Benchmarking and evaluation of multilingual models were key topics addressed by Preslav Nakov and Neha Sengupta. Nakov discussed the evaluation of bilingual models in terms of their knowledge and reasoning capabilities, using datasets like Exams and Mmu for Arabic. Sengupta shared her team’s approach to comparing the Jais Chat model’s performance in Arabic against other models, demonstrating its competitive edge.

Safety Mechanisms in LLMs and Ethical Considerations

The workshop also highlighted the importance of safety and ethical considerations in AI development. Preslav Nakov elaborated on various safety mechanisms implemented at different stages of model development, including data cleaning and prompt engineering. He stressed the need for a comprehensive taxonomy of potential risks to address harmful content and behaviors effectively.

Alignment with Target Cultural Aspects

Finally, the significance of aligning models with cultural aspects was a focal point in the talks by Olubayo Adekanmbi and Ahmet Üstün. Adekanmbi emphasized the need for AI models to be culturally sensitive and representative, especially in the African context, while Üstün discussed the role of community-based data collection in capturing linguistic diversity globally.

The workshop served as a testament to the burgeoning interest and advancements in multilingual AI. It provided a platform for experts to share their invaluable experiences and foster a collaborative approach to tackling the challenges of developing multilingual models.

Neha Sengupta

Principal Applied Scientist, G42

Developing Arabic-centric Bilingual LLMs

Joel Hestness, PhD.

Principal Research Scientist, Cerebras Systems

Pretraining the Jais Bilingual Arabic-English Language Models

Preslav Nakov

Prof at MBZUAI
(Mohamed bin Zayed University of Artificial Intelligence)

Guardrails and Evaluation of the Jais Arabic-English LLMs

Rio Yokota

Professor, Tokyo Institute of Technology

Overview of Japanese Efforts to Train LLMs

Felix Stollenwerk

Senior Research Scientist, AI Sweden

GPT-SW3: An LLM for Swedish and Nordic Languages

Irene Baucells de la Pena

Research Engineering, Barcelona Supercomputing Center

Joan Llop Palao

Research Engineering, Barcelona Supercomputing Center

Evaluating Language Adaptation Techniques for Mid-Resource Langs

Ahmet Ustun

Research Scientist, Cohere AYA

Accelerating Multilingual AI Progress with AYA

Sampo Pyysalo

Senior Researcher, TurkuNLP

FinGPT: Large Generative Models for a Small Language

Kshitij Gupta

Graduate Student, MILA

Ayush Kaushal

Graduate Student, U of Texas-Austin

Continued Pre-training of LLMs

Pratyush Kumar

Co-Founder, Sarvam.ai

Training an instruction tuned and aligned LLM for Indian languages

Olubayo Adekanmbi

Founder/CEO Data Science Nigeria

Highly-nuanced and Context-aware Data Generation approaches