Abstract
Foundation models are capable of providing general-purpose responses to a number of tasks. However, foundation models that require higher-quality outputs for specific tasks require more training. Our research indicates that smaller foundation models that are fine-tuned on domain-specific datasets and tasks can outperform a larger foundation model. We show that a GPT-NeoX 1.4B model that is fine-tuned for 2,000 training steps can perform just as well as the out-of-the-box GPT-J 6B model. Additionally, we show how users can easily fine-tune models using Launchpad, the simplified command line interface of our cloud-enabled Cerebras AI Model Studio.
Introduction to Training Foundation Models
Foundation models, such as BERT, CLIP, and the GPT family, including ChatGPT and GPT-4, are revolutionizing the way we approach machine learning. Trained using generic data, these models achieve great generalizability across a wide range of tasks. For some users, using a foundation model as-is, or out-of-the-box, will be sufficient – this is called Zero-Shot Learning (ZSL). These users are interested in a task that the model is already capable of doing and are satisfied with the quality of the output.
However, users that require either a task that is unique to the model or a higher quality output for already supported tasks will need to further train a foundation model. For example, a user may want to adapt the foundation model so that it can detect fake news. Or perhaps a user may want to train the model from scratch using protein data or legal text. In each of these use cases, the model does not fully understand the specialized language of the domain, and more performance can be gained by further training the model on a domain-specific dataset that reflects these differences in language.
Performance gains from fine-tuning
At Cerebras, we wanted to test if fine-tuning a model on a domain-specific dataset could achieve better performance when compared to an out-of-the-box model. We chose to conduct several experiments where we would fine-tune various GPT models on different datasets to see how they perform with respect to out-of-the-box GPT models. For each experiment, we saved checkpoints along the way and computed the following evaluation metrics: training loss, evaluation loss, accuracy, and perplexity. Following normal practice, the evaluation run was done using a separate, “eval” dataset which is separate from the training dataset. Please see the definitions below for each of the evaluation metrics.
- Training loss – The difference between the predicted outcome of a given model and its corresponding true output (also known as target or label) for a specific set of training data. Training loss measures how well the model fits the training data
- Evaluation loss – The difference between the predicted outcome of a given model and the actual outcome for a separate validation dataset that is not used during the training process. The purpose of evaluation loss is to provide an estimate of the model’s generalization performance on new, unseen data. A lower evaluation loss typically indicates better generalization performance.
- Accuracy – The proportion of correctly classified samples over the total number of tokens in the training data. Accuracy measures how well the model is able to classify the training data
- Perplexity – The average number of candidate words that could follow a given sequence of words. A lower perplexity indicates that the model is better at predicting the next word in a sequence.
Fine-tuning with TRC2 Corpus
We began our experiments by fine-tuning the GPT-J 6B on the TRC2 Corpus dataset, which contains 1,800,370 news stories from Reuters. We fine-tuned GPT-J 6B on the TRC2 Corpus dataset for a total of 12,000 steps in order to compare it to the out-of-the-box (step zero) GPT-J 6B model. Please note that step zero is the equivalent of Zero-Shot Learning (ZSL), or using a foundational model out of the box without any fine-tuning. Figure 1 shows the eval plot of each of the metrics. All three metrics keep improving which indicates that the model continues to generalize and improve.
Figure 1. Eval metric plots for GPT-J 6B on TRC2 dataset.
In our next experiment, we fine-tuned the GPT-NeoX 1.4B model on the TRC2 Corpus dataset for a total of 3,000 steps and saved a checkpoint every 1,000 steps in order to compare it to the out-of-the-box (step zero) GPT-J 6B model.
Model | Step | Accuracy | Loss | Perplexity |
---|---|---|---|---|
GPT-J 6B | 0 (Original) | 0.625 | 1.92 | 6.84 |
0 (Original) | 0.439 | 2.89 | 18.00 | |
GPT-NeoX 1.4B | 1000 | 0.570 | 2.13 | 8.44 |
2000 | 0.602 | 1.88 | 6.56 | |
3000 | 0.621 | 1.76 | 5.84 |
Table 1. GPT-NeoX 1.4B evaluation metrics at different checkpoints compared to zero-shot GPT-J 6B.
As shown in Table 1 and Figure 2, the zero-shot performance of a GPT-J 6B model trained using the TRC2 corpus performs nearly as well as a fine-tuned GPT-NeoX 1.4B model after about 2,000 training steps. If we continue training GPT-NeoX 1.4B, it will perform even better against the out-of-the-box GPT-J 6B. This indicates that a user can fine-tune a smaller model on a domain-specific dataset and outperform an out-of-the-box larger model. Users that elect to train smaller models will require less resources when deploying for inference, which will save them significant money over the lifetime of their generative AI application.
It is worth noting that the numbers in Table 1 were achieved with very little hyper-parameter tuning. More hyper-parameter tuning and iteration could yield even better results. With fine-tuning, GPT-J 6B would outperform NeoX 1.4B but that would require more compute for training and inference. The correct trade-off will depend on the goal of the application.
Figure 2. After 3,000 fine-tuning steps, the smaller model accuracy approaches that of the much larger model. Loss and perplexity are better after only 2,000 fine-tuning steps.
Fine-tuning with Curation Corpus
Another dataset we used for fine-tuning using the Cerebras AI Model Studio is Curation Corpus, a collection of 40,000 professionally-written summaries of news articles.
We fine-tuned the GPT-NeoX 1.4B foundation model using Curation Corpus for 1000 steps.
Step | Accuracy | Loss | Perplexity |
---|---|---|---|
0 | 0.50 | 2.78 | 16.0 |
1000 | 0.52 | 2.17 | 8.8 |
Table 2. GPT-NeoX 1.4B evaluation metrics at different checkpoints on Curation Corpus.
Fine-tuning with BigPatent
Another summarization dataset that uses a different domain is BigPatent, which consists of 1.3 million records of U.S. patent documents with human-written, abstractive summaries. Due to the legal nature of the original patent text, patent summarization is a challenging task.
We fine-tuned a GPT-J 6B foundation model using the BigPatent dataset for 7,000 steps.
Step | Accuracy | Loss | Perplexity |
---|---|---|---|
0 | 0.55 | 1.99 | 7.28 |
7000 | 0.60 | 1.72 | 5.59 |
Table 3. GPT-J 6B evaluation metrics at different checkpoints on BigPatent.
The results in Table 2 and Table 3 both show significant improvement compared to Zero-Shot Learning.
Fine-Tuning vs Training from Scratch
Users have two options when creating domain-specific models: further training a foundation model, including tuning an existing generic foundation model, and training a new model from scratch. As discussed above, fine-tuning refers to the process of taking a pre-trained model and adapting it to a new task or dataset by further training on the new data, usually with smaller learning rates and fewer training epochs. Training from scratch means training a new model, which has not already been trained on a dataset, on a specific task, or dataset. Both fine-tuning and training from scratch have different requirements, as shown in Table 4.
Fine-Tuning | Training from Scratch | |
---|---|---|
Task Requirements | Task must be similar to the tasks learned by the foundation model | No task requirement as the model has not been exposed to data yet. |
Data Requirements | Dataset does not need to be large, but should be similar to the original dataset used for pre-training. | Dataset should be sufficiently large to avoid over-fitting and achieving good performance |
Compute Requirements | Faster and requires less computational resources than training from scratch since the model’s initial parameters have already been learned. | Computationally expensive and time-consuming, especially for complex models with many parameters |
Model Performance | Fine-tuning leads to better performance than out of the box foundation model when the pre-trained model is relevant to the new task, and the new dataset is similar to the original dataset. | Training from scratch can outperform fine-tuning when (1) the pre-trained models are not relevant to the new task, (2) the new dataset is significantly different from the original dataset, or (3) if the dataset size is sufficiently large |
Table 4. Fine-Tuning versus Training from scratch.
To summarize, fine-tuning is used to adapt pre-trained models to new tasks or datasets, while training from scratch is the process of training new models with no prior knowledge of any dataset. Fine-tuning is usually faster and requires fewer computational resources, but training from scratch can be more powerful when pre-trained models are not relevant, or the new dataset is significantly different from the original dataset.
Training with the Cerebras AI Model Studio Launchpad
Users that are interested in fine-tuning or training from scratch can do so with the Cerebras AI Model Studio Launchpad.
For the initial set of supported models (Table 5), we chose a diverse set of NLP models to cover a variety of options for your fine-tuning tasks. These include foundation models are from EleutherAI, Salesforce, and our own Cerebras-GPT family. More models coming soon! These models are compatible with the Hugging Face checkpoint format.
Note that different model sizes provide a tradeoff between compute versus target accuracy. Smaller foundation models will require less compute and train faster compared to larger models. In addition, smaller models are easier to deploy.
Another difference is the actual data that the foundation models are trained on. For example, the Saleforce CodeGen-Multi series of models are fine-tuned on the programming language dataset, while Saleforce CodeGen-Mono series are fine-tuned on Python programming language. Therefore, depending on your task, some models might provide better results than others.
Our internally trained Cerebras-GPT style models were trained on the Pile dataset. Those models have been trained in accordance with Chinchilla scaling laws. You can explore the models in more detail in our Model Zoo repository, and learn more about the training methodology in our paper, “Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster”.
Model | Description |
---|---|
EleutherAI GPT-J 6B | |
EleutherAI GPT-NeoX | We support all checkpoint variants in Hugging Face from EleutherAI NeoX: • EleutherAI GPT-NeoX 20B • EleutherAI Pythia: 70M, 160M, 419M, 1B, 1.4B, 2.8B, 6.9B and 12B. |
Saleforce CodeGen | We support all checkpoint variants of Saleforce Codegen: • CodeGen-NL: 350M, 2.7B, 6.1B and 16.1B. • CodeGen-Multi: 350M, 2.7B, 6.1B and 16.1B. • CodeGen-Mono: 350M, 2.7B, 6.1B and 16.1B. |
Cerebras-GPT | We support the following variants of Cerebras-GPT • Cerebras-GPT: 111M, 256M, 590M, 1.3B, 2.7B, 6.7B and 13B. |
Table 5. List of foundation models supported by AI Model Studio.
We offer these models for fine-tuning at competitive prices and with 8x faster training speeds versus traditional cloud. Cerebras is capable of delivering performance gains at lower costs through the use of the Cerebras Wafer-Scale Cluster. And with the Cerebras AI Model Studio, your data is always secure and remains yours; you own your own ML methods, and your trained weights are yours to keep. Not only that, but our staff of experts, who produced this exciting research, is on-hand to help you optimize your fine-tuning or training from-scratch experiment.
Quick Guide to Fine-Tuning with the Cerebras AI Model Studio Launchpad
The easiest way for users to set up and run their experiments is to use Launchpad, the Cerebras AI Model Studio’s simplified command-line interface. This takes away the overhead of setting up the environment, preparing the model code and checkpoints, and the challenge of running large foundational models at scale and performance. It is designed to allow the user to focus on their experiments from the start so they can just dive in. Follow these simple steps to train your first model:
1. Log in to the user node (ssh user@ip-addr)
2. Copy over your dataset in our recommended based on our data loader guide
3. Enter Launchpad
Welcome to Launchpad. Type help or ? to list commands. > help Documented commands (type help <topic>): ======================================== add exit help list start stop eval experiment history run status view >
4. Enter list
to see which model was selected, the available datasets and checkpoints, and the hyperparameters one can change in the model.
> list Model name: GPT-3-1.3B Available datasets: - pile_2048 Available checkpoints: - ID: 4, Timestamp: 2023-03-29 00:12:25.398312, global_step: 0 - ID: 5, Timestamp: 2023-03-29 00:31:36.099650, global_step: 100 - ID: 9, Timestamp: 2023-03-29 13:36:47.741818, global_step: 10100 Hyperparameters: model: dropout: constraints: - Number must be in range [0.0, 1.0] - Type of the value must be one of [float] default: 0.0 description: Dropout rate to use. required: false optimizer: Refer to documentation for a full list of available optimizers and learning rate schedulers.
5. Enter the add dataset
command to add the custom dataset you have already copied over.
> add dataset -h usage: add dataset [-h] --name NAME --paths PATHS [PATHS ...] Add a new dataset to registry of available datasets. optional arguments: -h, --help show this help message and exit --name NAME Unique name of the dataset --paths PATHS [PATHS ...] List of data directories for this dataset. > add dataset –name example_dataset –paths <path_to_dataset_dir>
6. Enter the experiment
command to add an experiment with the hyperparameters of your choice.
> experiment -h usage: experiment [-h] {add,view,delete} ... Manage experiments to run. positional arguments: {add,view,delete} add Add a new experiment view View existing experiments delete Delete an experiment optional arguments: -h, --help show this help message and exit >
7. The experiment add
command will open the configuration file in the vim editor. The configuration file follows the yaml syntax. Here you can change the model and hyperparameters, including the number of steps to train for.
8. Enter the run
command to start training.
> run -h usage: run [-h] [-n N] optional arguments: -h, --help show this help message and exit -n N Run the last `n` experiments. If not provided, it will run the last available experiment. >
9. Enter the status
command to see the list of jobs that have been started and their current status. This command will also provide the tensorboard link to view progress and results.
The user can also run status view --job_id <job_id>
to get more details about a specific job, such as losses, hyperparameters that were used for it, and checkpoints generated by the run.
> status Tensorboard at: http://sc-r10ra10-s15.cerebrassc.local:43900/ +---------+--------+-------------+-----------+----------------+ | Job ID | Model | Start Time | End Time | Latest Status | +---------+--------+-------------+-----------+----------------+ > status view -h usage: status view [-h] --job_id JOB_ID [--summaries | --hyperparams | --checkpoints] optional arguments: -h, --help show this help message and exit --job_id JOB_ID Filter by the given job id --summaries Display loss summaries --hyperparams Display hyperparameters used in this job --checkpoints Display all checkpoints collected from this run
The status
command can also be used to cancel a running job.
> status cancel -h usage: status cancel [-h] --job_id JOB_ID optional arguments: -h, --help show this help message and exit --job_id JOB_ID Filter by the given job id
10. To exit Launchpad, enter the exit
command.
Get Started
Fine-tuning with Cerebras AI Model Studio is easy and effective for building high-quality models. Contact Cerebras by emailing us at developer@cerebras.ai or by filling out this form. Please let us know if you are interested in a model that is not listed.
Authors:
Emad Barsoum, Senior Director of AI
Vishal Subbiah, ML Software Architect
Udai Mody, Product Marketing and Partnerships
April 18, 2023
Resources
- Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models (blog)
- Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster (arXiv paper)
- Cerebras AI Model Studio (Cerebras Cloud)
- Cerebras Model Zoo (GitHub)
- Cerebras-GPT model cards (Hugging Face)
- Cerebras Discord server (Discord)
- Cerebras Community Forum (Discourse)