Cerebras is proud to introduce CePO (Cerebras Planning and Optimization), a framework that adds sophisticated reasoning capabilities to the Llama family of models. Through test-time computation techniques, we enable Llama to tackle complex reasoning tasks with unprecedented accuracy. While models like OpenAI o1 and Alibaba QwQ have shown how additional computation at inference time can dramatically improve problem-solving capabilities [1], we’re now bringing these advances to Llama – the world’s most popular open-source LLM.
By applying CePO to Llama 3.3-70B [3], we enable it to significantly surpass Llama-405B across challenging benchmarks in coding, math, and reasoning tasks. Running on Cerebras hardware, we achieve interactive performance of approximately 100 tokens/second – a first among test-time reasoning models.
Meta released Llama-3.3 70B with the goal of narrowing the accuracy gap with Llama-3.1 405B. By integrating CePO, we empower Llama-3.3 70B to surpass Llama-3.1 405B in accuracy, demonstrating that our approach provides greater value as the quality of LLMs improve. While CePO helped Llama-3.1 70B to nearly match the accuracy of Llama-3.1 405B on MMLU-Pro (Math), GPQA and CRUX, it propels Llama-3.3 70B to decisively surpass Llama-3.1 405B by several percentage points!
We also measured Llama 3.3 70B with CePO against GPT-4 Turbo and Sonnet 3.5. Thanks to CePO, Llama 3.3 70B reached comparable performance in CRUX, LiveCodeBench, GPQA and significantly outperformed in MATH.
How CePO Works
CePO is built based on three key insights:
1. Step-by-step reasoning: An LLM has a higher chance of successfully tackling complex questions by breaking them into simple steps that it can execute correctly with high confidence.
2. Comparison instead of Verification: Off-the-shelf LLMs struggle with self-refinement, showing both false positives and negatives. However, they excel at comparing solutions to identify inconsistencies and determine whether both solutions are correct or at least one is incorrect.
3. Intuitive Output Format: LLMs perform better with numerical ratings or structured outputs when the format is demonstrated multiple times and clearly emphasizes expectations. For example, using “Rating: <number>” instead of just requesting a numerical rating.
The CePO pipeline consists of four stages:
-
-
- The LLM produces a plan to solve a given problem step-by-step
- The LLM executes the plan multiple times to produce multiple responses
- The model analyzes responses to identify inconsistencies across executions, helping catch and correct mistakes
- Solutions are wrapped in a Best-of-N framework with structured confidence scoring
-
In general, CePO consumes 10-20x more inference tokens than one-shot approaches. However, when running on Cerebras hardware, we achieve equivalent performance of about 100 tokens/second – comparable to leading models like GPT-4 Turbo and Claude 3.5 Sonnet.
Real-World Examples
Strawberry Test v2.0
The strawberry test is a popular test to see if models memorize answers or evaluate based on first principles. Strawberry correctly spelt has 3 ‘r’s. When more ‘r’s are added, most models get it wrong since it simply remembers the answer as 3. With CePO, the model checks its own work and supplies the correct answer.
Russian Roulette with a Twist
This reasoning puzzle again tests a model’s ability to evaluate based on first principles instead of going with conventional wisdom/memorized answers. We modify the popular Russian Roulette problem as follows – instead of one bullet, we add five bullets to the gun, changing which option provides a higher chance of survival. Here again, the model tends to get it wrong in a single-shot setting but chooses correctly when it thinks through its solution.
Comparison with Popular Prompting Strategies
Many researchers have explored inference-time techniques, with tools like optillm [10] making these methods accessible to practitioners. We evaluated three popular approaches from the literature:
CoT with Reflection [11]
-
- Uses structured reasoning with three phases
- <thinking>: Initial problem-solving strategy
- <reflection>: Review and adjust the plan
- <output>: Final solution
Self-Consistency [12]
-
- Generates multiple solutions
- Groups similar answers together
- Selects final answer based on the most common solution pattern
Best-of-N [13, 14]
-
- Simply generates N different solutions
- Picks the best one
While these methods show promise on specific tasks, they lack consistency across different types of problems. Our evaluation reveals that their performance varies significantly between tasks, and in some cases, they perform worse than the base model with no augmentation. By combining multiple approaches, CePO achieves consistent improvement across evals without regressions.
Conclusion
CePO demonstrates how test-time computation can significantly enhance Llama’s reasoning capabilities. Our next steps focus on three key areas:
- Advanced prompting frameworks that leverage comparative reasoning
- Synthetic data optimized for inference-time computation
- Enhanced verification mechanisms for complex reasoning chains
We will be open sourcing CePO to support community-driven development of inference-time optimization techniques. For updates on the release and to try CePO, follow us on Twitter or join our Discord.
Author: Pawel Filipczuk, Vithursan Thangarasa, Eric Huang, Amaan Dhada, Michael Wang, Rohan Deshpande, Emma Call, Ganesh Venkatesh
@misc{cerebras2024ceporeasoning, author = {Filipczuk, Pawel and Thangarasa, Vithursan and Huang, Eric and Dhada, Amaan and Wang, Michael and Deshpande, Rohan and Call, Emma and Venkatesh, Ganesh}, title = {CePO: Empowering Llama with Reasoning using Test-Time Compute}, month = December, year = 2024, howpublished = {\url{https://cerebras.ai/blog/cepo}, }
References
[1] https://openai.com/index/learning-to-reason-with-llms/
[2] Dubey, Abhimanyu, et al. The Llama 3 Herd of Models, arXiv preprint arXiv:2407.21783 (2024).
[3] https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
[4] Gu, Alex, et al. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, arXiv preprint arXiv:2401.03065 (2024).
[5] Jain, Naman, et al. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, arXiv preprint arXiv:2403.07974 (2024).
[6] https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-5
[7] Wang, Yubo, et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, arXiv preprint arXiv: 2406.01574 (2024).
[8] Rein, David, et al. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv preprint arXiv: 2311.12022 (2023).
[9] https://openai.com/index/introducing-simpleqa/
[10] https://github.com/codelion/optillm
[11] Wei, Jason, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, arXiv preprint arXiv:2201.11903 (2022).
[12] Wang, Xuezehi, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models, arXiv preprint arXiv:2203.11171 (2023).
[13] Stiennon, Nisan, et al. Learning to summarize from human feedback, arXiv preprint arXiv: 2009.01325 (2022).
[14] Nakano, Reiichiro, et al. WebGPT: Browser-assisted question-answering with human feedback, arXiv preprint arXiv:2112.09332 (2022).
[15] https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo#gpt-4-turbo-andgpt-4
[16] https://www.anthropic.com/news/claude-3-5-sonnet