At NeurIPS 2023, we kicked off our inaugural workshop on multilingual models, where experts and enthusiasts alike converged to exchange insights and progress in the realm of multilingual models.
Research experts from around the world joined to speak on key areas of multilingual model development, including 1) dataset sourcing, 2) mixing languages in datasets, 3) pre-training in data-constrained environments, 4) instruction tuning without instruction datasets in target languages, and 5) benchmarking in a predominantly English-centric field. The workshop also emphasized the importance of cultural alignment in model development.
This blog provides a detailed overview of the workshop and includes links to each of the talks. Additionally, we include links to the relevant talks for each topic pertaining to multilingual model development.
Dataset Gathering, Sourcing, and Cleaning Methodology
Neha Sengupta from G42 shed light on the challenges of gathering sufficient Arabic data for their large-scale models. She detailed the rigorous process of dataset compilation and cleaning, underscoring the disparity between the abundance of English data and other languages like Arabic. Olubayo Adekanmbi highlighted the importance of culturally rich datasets, particularly in the African context, emphasizing the need for data that captures the nuances of local languages and cultures. Pratyush Kumar from Sarvam.ai covered challenges in creating high-quality pre-training, finetuning, and preference optimization datasets for Indian languages. Sampo Pyysalo from TurkuNLP explored the challenges involved in curating datasets for Finnish, a language spoken by less than 0.1% of the world population.
Best Practices for Mixing Different Languages and Datasets
Joel Hestness from Cerebras delved into the intricacies of mixing datasets of different languages. Hestness shared insights on balancing Arabic and English datasets. Ann and Joan from the Barcelona Supercomputing Center focused on the challenges of developing models for Spanish and Catalan, highlighting the significance of considering cultural nuances in data sourcing. Finally, Felix Stollenwerk of AI Sweden provided a deep dive into the tokenization process of a multilingual LLM trained in 6 Nordic languages.
Pre-training and Continuous Training Recipes
Kshitij Gupta from MILA and Ayush Kaushal from the University of Texas-Austin discussed the continuous training of Indian language models, focusing on efficiently integrating new data into existing models without causing catastrophic forgetting. They highlighted the need for a lifelong learning system for language models, adapting as new data becomes available. Joel Hestness also touched upon pre-training methods, discussing the significance of Maximal Update Parameterization for stable training as models scale up. Finally, Rio Yokota from the Tokyo Institute of Technology shared technical details and challenges with pre-training Japanese LLMs up to 175B parameters. Sampo Pyysalo describes two different approaches to pretraining an LLM for low-resource languages, including 1) training seven monolingual models from scratch (186M to 13B parameters) and 2) continuing the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish.
Instruction Tuning Without Instruction Datasets
Preslav Nakov and Ahmet Üstün explored the complexities of instruction tuning in the absence of instruction datasets in target languages. Nakov emphasized the process of machine translating English datasets to create native Arabic evaluation datasets, while Üstün highlighted the community-driven effort to collect multilingual data, enhancing the quality and representation of underrepresented languages.
Benchmarking and Evaluation
Benchmarking and evaluation of multilingual models were key topics addressed by Preslav Nakov and Neha Sengupta. Nakov discussed the evaluation of bilingual models in terms of their knowledge and reasoning capabilities, using datasets like Exams and Mmu for Arabic. Sengupta shared her team’s approach to comparing the Jais Chat model’s performance in Arabic against other models, demonstrating its competitive edge.
Safety Mechanisms in LLMs and Ethical Considerations
The workshop also highlighted the importance of safety and ethical considerations in AI development. Preslav Nakov elaborated on various safety mechanisms implemented at different stages of model development, including data cleaning and prompt engineering. He stressed the need for a comprehensive taxonomy of potential risks to address harmful content and behaviors effectively.
Alignment with Target Cultural Aspects
Finally, the significance of aligning models with cultural aspects was a focal point in the talks by Olubayo Adekanmbi and Ahmet Üstün. Adekanmbi emphasized the need for AI models to be culturally sensitive and representative, especially in the African context, while Üstün discussed the role of community-based data collection in capturing linguistic diversity globally.
The workshop served as a testament to the burgeoning interest and advancements in multilingual AI. It provided a platform for experts to share their invaluable experiences and foster a collaborative approach to tackling the challenges of developing multilingual models.
Neha Sengupta
Principal Applied Scientist, G42
Developing Arabic-centric Bilingual LLMs
Joel Hestness, PhD.
Principal Research Scientist, Cerebras Systems
Pretraining the Jais Bilingual Arabic-English Language Models
Preslav Nakov
Prof at MBZUAI
(Mohamed bin Zayed University of Artificial Intelligence)
Guardrails and Evaluation of the Jais Arabic-English LLMs
Rio Yokota
Professor, Tokyo Institute of Technology
Overview of Japanese Efforts to Train LLMs
Felix Stollenwerk
Senior Research Scientist, AI Sweden
GPT-SW3: An LLM for Swedish and Nordic Languages
Irene Baucells de la Pena
Research Engineering, Barcelona Supercomputing Center
Joan Llop Palao
Research Engineering, Barcelona Supercomputing Center
Evaluating Language Adaptation Techniques for Mid-Resource Langs
Ahmet Ustun
Research Scientist, Cohere AYA
Accelerating Multilingual AI Progress with AYA
Sampo Pyysalo
Senior Researcher, TurkuNLP
FinGPT: Large Generative Models for a Small Language
Kshitij Gupta
Graduate Student, MILA
Ayush Kaushal
Graduate Student, U of Texas-Austin
Continued Pre-training of LLMs
Pratyush Kumar
Co-Founder, Sarvam.ai
Training an instruction tuned and aligned LLM for Indian languages
Olubayo Adekanmbi
Founder/CEO Data Science Nigeria
Highly-nuanced and Context-aware Data Generation approaches
Related Posts
August 28, 2024
Integrating LLMs and Software Engineering for Self-Refining Copy Creation
Discover how to build an AI agent that generates marketing copy efficiently…
August 28, 2024
ReadAgent: Bringing Gist Memory to AI
Learn how gist memory improves long context handling for large language models.…