We are excited to announce the release of Cerebras DocChat, our first iteration of models designed for document-based conversational question answering. This series includes two models: Cerebras Llama3-DocChat, a large language model (LLM), and Cerebras Dragon-DocChat, a multi-turn retriever model. These models were not only developed leveraging our deep ML expertise, but were also trained with remarkable speed. Using a single Cerebras System, Llama3-DocChat was trained in just a few hours, while Cerebras Dragon-DocChat was fine-tuned in just a few minutes (yes you read that correctly).

Cerebras Llama3-DocChat was built on top of Llama 3 base using insights from the latest research on document-based Q&A, most notably Nvidia’s ChatQA model series. As part of this work, we leveraged our experience in LLM model training and dataset curation to incorporate key elements of ChatQA’s training approach. Additionally, we employed synthetic data generation to address limitations that couldn’t be fully resolved with the available real data.

Similarly, Cerebras Dragon-DocChat was built on top of the Dragon+ model and trained on ChatQA’s conversational Q&A dataset. By finetuning using contrastive loss with hard negatives, we see absolute improvements in recall of 8.9% over Dragon+ and 3.5% over ChatQA Dragon-Multiturn respectively (top-1).

Open Source Commitment

In line with our commitment to open source, we are releasing not only the model weights but also the complete training recipes and associated datasets. This transparency allows the AI community to replicate, build upon, and innovate with our work. See below for links.

Benchmarks

The DocChat models have been evaluated across a variety of benchmarks, and achieve top of the line performance for their model sizes.

ChatRAG Benchmark Llama3 Instruct 8B Command-R-Plus Nvidia Llama3-ChatQA 1.5 8B GPT-4-Turbo-2024-04-09 Cerebras Llama3-DocChat 1.0 8B
Doc2Dial 31.33 33.51 39.33 35.35 39.19
QuAC 32.64 34.16 39.73 40.1 36
QReCC 43.4 49.77 49.03 51.46 50.27
CoQA 73.25 69.71 76.46 77.73 79.56
DoQA 30.34 40.67 49.6 41.6 48.77
ConvFinQA 53.15 71.21 78.46 84.16 80.13
SQA 36.6 74.07 73.28 79.98 74.19
TopioCQA 34.64 53.77 49.96 48.32 52.13
HybriDial* 40.77 46.7 65.76 47.86 64
INSCIT 32.09 35.76 30.01 33.75 32.88
Average (all) 40.82 50.93 55.17 54.03 55.71
Average (Exclude HybriDial) 40.83 51.4 53.99 54.72 54.79

Eleuther Eval Harness Llama3 Instruct 8B Nvidia Llama3-ChatQA 1.5 8B Cerebras Llama3-DocChat 1.0 8B
hellaswag 57.68 61.37 61.68
winogrande 71.98 73.95 74.11
truthfulqa_mc1 36.23 28.52 29.25
truthfulqa_mc2 51.65 43.56 45.14
mmlu 63.84 60.68 62.86
gsm8k 76.12 13.72 55.57
arc_easy 81.61 80.56 82.03
arc_challenge 52.99 51.02 53.92
Average 61.51 51.67 58.07

Benchmark Metric Facebook Dragon+ Nvidia Dragon Multiturn Cerebras Dragon-DocChat
Doc2Dial Recall@1 43.95 50.11 51.54
Recall@5 77.61 83.85 83.12
Recall@20 92.05 95.33 95.25
QuAC Recall@1 62.09 60.02 61.30
Recall@5 86.01 86.51 87.69
Recall@20 96.48 96.6 97.25
QReCC Recall@1 49 49.43 55.41
Recall@5 85.14 86.6 90.11
Recall@20 97.21 98.28 98.39
INSCIT* Recall@1 11.134 18.35 21.65
Recall@5 29.27 48.45 50.72
Recall@20 49.07 66.19 72.78
Topiocqa* Recall@1 29.19 31.34 38.19
Recall@5 62.53 65.79 72.47
Recall@20 83.69 84.37 87.23
Average** Average top 1 49.37 54.76 58.29
Average top 5 76.30 81.50 84.19

*Evaluated on a subset of the wikipedia corpus that was available to us. All models use the same evaluation strategy to ensure apples-to-apples comparisons. 

** We follow the same convention as in ChatQA, where we compare top-5 and top-20 of TopiOCQA and INSCIT to top-1 and top-5, respectively, of the other datasets, in order match differences in average context length. 

Getting Started with DocChat 

Follow the links below to start using DocChat today! 

The Recipe 

While the ChatQA series provided a valuable foundation, we identified several gaps in their released datasets and training recipes. We crafted our final recipe by combining insights from analyzing their model as well as our own experience in LLM model training and dataset curation. Notably, we addressed the challenge of handling unanswerable questions, improving arithmetic performance and entity extraction:  

    • Handling Unanswerable Questions: In our initial attempts, the model struggled with unanswerable questions (i.e. responding “I can’t answer …”). The ChatQA paper notes that training on the synthetic conversational Q&A dataset yields worse performance on unanswerable benchmarks relative to when trained with their human dataset (which we don’t have access to); even so, the gap that we observed in initial experiments was substantial, requiring an alternative approach. Through experimentation, we found that upsampling the samples corresponding to unanswerable questions improves performance. Despite the improvement, the performance delta to SOTA on the QuAC and DoQA benchmarks (which test handling of unanswerable questions) show that there’s still room for further improvement. 
    • Arithmetic Performance: We observed that initial iterations of our model frequently made errors on arithmetic tasks (such as ConvFinQA) because it was trained to produce a final equation in a single shot. Inspired by Chain of Thought (CoT), we synthetically generated a CoT variant of the TAT-QA arithmetic dataset using Llama 3 70B Instruct which teaches the model to explicitly reason through it’s response before providing a final answer. We found that this addition led to a substantial boost in accuracy (+10 on ConvFinQA). We also mix in a small amount of the NuminaMath-CoT dataset, a collection of math problems with CoT solutions that range from high-school to olympiad difficulty levels, in order to improve our model’s GSM8k score. 
    • Entity Extraction: Our baseline performed poorly on entity extraction style tasks such as SQA, due to very few high quality samples that demonstrate the intended behavior. We solve this by mixing in a subset of SKGInstruct, an instruction tuning dataset constructed from a mix of structured knowledge grounding datasets. 

What’s Next? 

Our model series demonstrates strong conversational Q&A and multi-turn retrieval capabilities, and we can’t wait to see what the community can achieve with it. We invite you to explore, experiment, and improve upon our work. There are many exciting directions we’re exploring that we plan to use this model series as the foundation for, such as long context support, improved math/reasoning, larger model sizes, etc. Stay tuned for more updates and improvements. 🚀 

Contributors

Rohan Deshpande, Michael Wang, Ganesh Venkatesh