Aug 28 2024

Llama3.1 Model Quality Evaluation: Cerebras, Groq, SambaNova, Together, and Fireworks - Cerebras

Introduction

At Cerebras, we are redefining AI inference by delivering unparalleled speed, quality, and efficiency. Our new inference solution sets an industry benchmark, delivering 1800+ tokens per second for the Llama3.1-8B model and 450+ tokens per second for the Llama3.1-70B model—20x faster than GPU-based inference solutions on leading hyperscale clouds, as benchmarked by Artificial Analysis [1]. Our performance does not come at the expense of accuracy or reliability, enabling our clients to achieve exceptional results in their applications, from code generation to complex conversational AI.

In Generative AI, where rapid deployment of high-quality models is critical, Cerebras stands out by offering unmatched speed and quality. This blog discusses the top-tier results achieved by our Llama-3.1 offerings on a wide range of industry-standard benchmarks. Cerebras is setting a new standard in large-scale AI deployments by combining performance, quality, and cost-efficiency. To learn more on the technical details of how Cerebras Inference achieves record-breaking AI speeds and cost efficiency, check out our latest blog post here.

Not All Llama3.1 Models Are Created Equal

In the world of AI inference services, it is easy to assume that all models labeled “Llama3.1-70B”, for example, are identical in performance and quality. However, the reality is far more complex. While the name might be the same, the quality of these models can vary significantly depending on the techniques being used during their deployment.

When using an inference service, the model often appears as a black box. Users trust that what is inside matches their expectations, but not all Llama3.1 models are created equal. Techniques such as quantization, which reduce the precision of model weights and/or activations to improve performance, can degrade the quality of the model’s outputs as shown across recent studies [2, 3, 4]. Ensuring these techniques do not harm the model requires a great deal of care and extensive testing.

Deep learning and large language models (LLMs) involve nuanced numerics, where correctness is statistical rather than absolute. Subtle bugs in these numerics can easily slip by unnoticed, leading to degraded model accuracy. Deploying a model, therefore, involves significant effort to identify and rectify these potential issues before they impact users.

This complexity and potential for unseen issues are why it is crucial not to take model quality for granted. The only reliable way to ensure a model meets the highest standards is through a comprehensive array of evaluation benchmarks. Relying on just a few benchmarks is not enough, as models can excel in some tasks while faltering in others due to the intricacies mentioned above.

Moreover, conducting these evaluations properly requires careful and detailed attention. To truly compare models fairly, you must run the exact same evaluation on each model, under the same conditions. Simply comparing published benchmark numbers from different sources will not provide an accurate picture. An apples-to-apples comparison is necessary to truly understand the differences in quality between models.

In this blog, we do exactly that. We conduct rigorous evaluations to demonstrate and prove the high quality of our inference solution.

Ensuring Quality

Ensuring the highest model quality is vital as we transition from research to production, as it can impact user experience and trust in AI solutions. This is especially important when models are being deployed at scale to support a wide range of customer user interactions. In these scenarios, models often play a crucial role in critical decision-making processes, where accuracy and reliability are imperative. We rigorously evaluate our models across general, multilingual, math and reasoning, coding, and multi-turn conversations to ensure that we are offering the highest quality. In addition, based on Artificial Analysis’ historical data on the output speed (i.e., tokens per second) over time, we have conducted a thorough comparison of quality against leading inference providers: Fireworks, Together, and Groq. To ensure consistency across all evaluations, we use the default OpenAI system prompt, “You are a helpful assistant”. This prevents variations that could result from using different system prompts. All results reported across these inference providers were collected on September 11, 2024, using their official Python SDKs. The only exception is SambaNova, which uses the OpenAI SDK as they do not have a publicly available version of their own.

General and Reasoning Evaluation

For our evaluation, we focused on Llama3.1-8B Instruct and Llama3.1-70B Instruct models [5], assessing them across benchmarks such as MMLU [6], MATH [7], GPQA [8], DROP [9], and MGSM [10]. These diverse tasks allowed us to comprehensively gauge quality across various domains, from general knowledge to complex mathematical reasoning. We conducted experiments with a temperature setting of 0.2 and a maximum token generation limit of 2048, ensuring deterministic outputs and fully exploring the models’ capacities.

To judge the quality on the MATH task, we utilized OpenAI’s GPT-4o [11] model as our equality checker, maintaining a consistent standard for evaluating mathematical reasoning. We use the simple-evals library [12], an open-source framework from OpenAI for evaluating LLMs. It is important to note that evaluations are sensitive to prompt structuring, with considerable variability in methods across recent studies [13, 14, 15]. While some use few-shot or role-playing prompts, we adhere to OpenAI’s GPT-4/4o evaluation practices, opting for a zero-shot with chain-of-thought approach. This method provides clear and simple instructions, better reflecting real-world usage, where users expect models to respond accurately without extensive prompting.

For reference, the results from other leading inference API providers, measured using our evaluation pipeline, align with the model quality assessments conducted by Artificial Analysis [16, 17]. On the zero-shot with chain-of-thought tasks (see Figure 1), Cerebras delivers strong quality with both the Llama3.1-8B and Llama3.1-70B Instruct models, setting a new industry standard. Both of Cerebras’ Llama3.1-8B and Llama3.1-70B models offer the highest accuracy on average across all five tasks in comparison to other leading inference API providers.

Code Evaluation

We evaluated our models using key benchmarks in the open-source EvalPlus v0.2.2 framework [18], which is specifically designed to assess coding quality. These include:

HumanEval (base) [19]: A widely-used benchmark for Python code generation, focusing on relatively straightforward, self-contained functions. These functions are designed to be representative of typical coding tasks and provide a solid baseline for assessing model quality.
MBPP (base) [19]: This version of MBPP focuses on a carefully curated selection of 378 well-formed problems from the original MBPP dataset [20], filtering out the initial 974 problems. This ensures that the evaluation is based on high-quality, representative coding challenges, providing an accurate reflection of the model’s real-world response quality.
HumanEval+ [19]: This enhanced version of the HumanEval benchmark introduces 80x more tests designed to reduce the likelihood of false positives. By expanding the test set, HumanEval+ provides a more robust measure of a model’s ability to generate correct code across a broader range of scenarios.
MBPP+ [19]: This is an enhanced version of the MBPP (base) dataset, featuring 399 carefully curated problems. It includes 35x more test cases, and fixes problems with incorrect implementations from the original dataset to provide a more rigorous and comprehensive evaluation.

Our evaluation suite focused on Llama 3.1’s ability to generate functionally correct code across these well-established benchmarks. We assess this using the standard pass@1 metric, which measures the accuracy of the first generated code output when run against a set of unit tests. To ensure a consistent and rigorous evaluation, we follow EvalPlus and employ greedy decoding during code generation. In all four tasks, we adhere to standard practices by setting a maximum token limit of 768 tokens per generation. For the coding benchmarks, we collected the results on other providers by benchmarking their models using the same evaluation framework to ensure a fair comparison.

The accompanying plot in Figure 2 illustrates our models’ accuracy across these benchmarks, highlighting its strengths. Notably, Cerebras’ Llama3.1-70B model consistently outperforms those from other leading inference API providers across all evaluated coding benchmarks. By providing the Llama3.1-8B and 70B Instruct models through Cerebras’ inference solution, we enable users to achieve top-tier results in code generation, setting new standards for accuracy, efficiency, and reliability.

Multi-Turn Conversation Evaluation

For the final evaluation, we focused on assessing the single and multi-turn conversational capabilities of the LLama3.1-8B and LLama3.1-70B models offered through our inference solution. The evaluation was conducted using MT-Bench [21], a rigorous multi-turn benchmark that specifically tests the capacity of LLMs to maintain coherent, informative, and engaging conversations over multiple interactions. This benchmark is particularly effective in evaluating how well LLMs can follow instructions and handle complex, dynamic user interactions. On MT-Bench, we also collected the results on other providers by benchmarking their models using the same evaluation framework.

For this evaluation, we configured the models following the standard temperature setup in MT-Bench, tailored to different categories. Here’s a breakdown of the configuration:

Temperature Configuration:
- Writing & Roleplay: 0.7
- Extraction, Math, Coding, Reasoning: 0.0
- STEM & Humanities: 0.1
Maximum Generation Tokens: 1024
Evaluation Method: We use GPT-4o, specifically gpt-4o-2024-05-13, as an oracle LLM to serve as the judge, ensuring an unbiased and robust comparison. The results were averaged across 5 random seeds to guarantee consistency and reliability.

In Figure 3, the LLama3.1-70B model demonstrates exceptional quality, achieving a leading average score across both single-turn and multi-turn conversations. This high score indicates that the model not only excels at maintaining the flow of conversation but also effectively follows complex instructions across various contexts. These results highlight the strength of our inference solution in providing state-of-the-art conversational AI capabilities, making it a powerful tool for a wide range of applications, from writing assistance to complex reasoning tasks.

Summary of Evaluation

Our evaluation of the Llama3.1 models across multiple benchmarks highlights Cerebras’ inference performance in both accuracy and efficiency. LLama-3.1 8B and 70B deployed on Cerebras outperformed or matched leading providers in rigorous tests that covered general knowledge, multilingual reasoning, math reasoning, coding, and conversation tasks.

Notably, Cerebras’ Llama3.1-70B model excelled in 9 out of 10 benchmarks across general reasoning, coding, and multi-turn conversations. Our LLama3.1-70B model consistently delivered top-tier quality in general reasoning tasks, including MMLU, MATH, GPQA, DROP, and MGSM. On coding tasks, the model showed significant improvements in benchmarks like HumanEval and MBPP, demonstrating proficiency in generating correct code on the first attempt. Additionally, the model performed exceptionally well in multi-turn conversation tasks, maintaining coherent and informative interactions across various contexts, further showcasing the strength of Cerebras’ inference solution.

These results are made possible by the Cerebras CS-3 system and its industry-leading AI processor — the Wafer Scale Engine 3 (WSE-3). The WSE’s massive size and on-chip memory enable us to run models in 16-bit precision with minimal data movement, reducing latency and allowing real-time processing of large models like Llama3.1. The Cerebras WSE architecture enables instant inference while maintaining high accuracy, making our inference solution a top choice for developers and researchers seeking cutting-edge performance and model quality.

Table 1: Comparison of Llama3.1-70B model quality across leading API providers

Table 2: Comparison of Llama3.1-8B model quality across leading API providers.

Conclusion

By focusing on rigorous evaluations, we have ensured that our models not only perform well in controlled experimental conditions but also excel in real-world applications where quality matters most. The unmatched speed of 1,800 tokens per second for the Llama3.1-8B model and 450 tokens per second for the Llama3.1-70B model, combined with our emphasis on quality, positions Cerebras as a leader in large-scale AI deployments.

As we continue to push the boundaries of AI, our inference solution is a testament to our dedication to delivering the highest quality at the fastest speeds, empowering our clients to achieve exceptional results across a wide range of applications. Cerebras is not just offering another tool; we are offering a new standard in AI inference, one that redefines performance, quality, and cost-efficiency in the industry.

Author’s Note – September 12, 2024: Updated Evaluations Reflecting Recent Runs and Top Inference API Providers

The results in the original blog post were collected on August 13, 2024, from Fireworks, Together, and Groq. With the release of our new inference performance updates, we found it essential to re-run the evaluations to verify the continued quality of our models. To maintain fairness and accuracy, we conducted the same evaluations again on September 11, 2024, including the latest leading inference API providers, using Artificial Analysis’ historical data to ensure consistency in output speed comparisons.

Authors

Vithursan Thangarasa, Alex Tsaptsinos, Ian Milton, Ganesh Venkatesh, Valavan Manohararajah, Nish Sinnadurai, Sean Lie

Citation

To cite our blog please use:

References

[1] https://artificialanalysis.ai/

[2] Dutta, Abhinav, et al. Accuracy is Not All You Need, arXiv preprint arXiv: 2407.09141 (2024).

[3] Marchisio, Kelly, et al. How Does Quantization Affect Multilingual LLMs?, arXiv preprint arXiv: 2407.03211 (2024).

[4] Yuan, Jiayi, et al. KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches. arXiv preprint arXiv: 2407.01527 (2024).

[5] Dubey, Abhimanyu, et al. The Llama 3 Herd of Models, arXiv preprint arXiv:2407.21783 (2024).

[6] Hendrycks, Dan, et al. Measuring Massive Multitask Language Understanding, In ICLR (2021).

[7] Hendrycks, Dan, et al. Measuring Mathematical Problem Solving With the MATH Dataset, In NeurIPS (2021).

[8] Rein, David, et al. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arxiv preprint arXiv: 2311.12022 (2023).

[9] Dua, Dheeru, et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, In ACL (2019).

[10] Shi, Freda, et al. Language Models are Multilingual Chain-of-Thought Reasoners, In ICLR (2023).

[11] Zhao, Tony, et al. Calibrate before use: Improving few-shot performance of language models. In ICML (2021).

[12] https://github.com/openai/simple-evals/

[13] Sclar, Melanie, et al. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. In ICLR (2024).

[14] Zhao, Zihao, et al. Calibrate before use: Improving few-shot performance of language models. In ICML (2021).

[15] Lu, Yao, et al. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In ACL (2022).

[16] https://artificialanalysis.ai/models/llama-3-1-instruct-8b/providers

[17] https://artificialanalysis.ai/models/llama-3-1-instruct-70b/providers

[18] https://github.com/evalplus/evalplus

[19] Liu, Jiawei, et al. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, In NeurIPS (2024).

[20] Austin, Jacob, et al. Program synthesis with large language models, arXiv preprint arXiv:2108.07732 (2021).

[21] Zheng, Lianmin, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, In NeurIPS Datasets and Benchmarks Track (2023).