Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

Frontier AI now runs at instant speed. Last week we ran a customer workload on Llama 3.1 405B at 969 tokens/s – a new record for Meta’s frontier model. Llama 3.1 405B on Cerebras is by far the…

Guii: An Interactive Coding Companion for Creative Frontend Development

Guii offers a development experience similar to drawing on a canvas. By adding Guii Devtools to your codebase, you can interact directly on a webpage—selecting visual elements like boxes,…

Building an AI-Powered Search Assistant for Zoom Team Chat

Imagine a workday where all the answers you need are just a message away. No more switching between apps, no more digging through files and folders, no more endless searches. Just ask, and the…

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s

Today we’re announcing the biggest update to Cerebras Inference since launch. Cerebras Inference now runs Llama 3.1-70B at an astounding 2,100 tokens per second – a 3x performance boost over…

Simulating Human Behavior with Cerebras

LlamaSim is a multi-LLM framework that aims to simulate human behavior at scale. Given a specific environment (e.g., voters in Pennsylvania, students at CMU), we simulate how target…

The Practitioner’s Guide to the Maximal Update Parameterization

Introduction Maximal Update Parameterization (µP) offers significant advantages for neural network training, but its adoption has been limited due to the complexity of the underlying math and the…

Integrating LLMs and Software Engineering for Self-Refining Copy Creation

AI agents are among the most exciting advancements in the field of large language models (LLMs). By integrating agentic workflows, these models can now better handle planning, reasoning, and…

ReadAgent: Bringing Gist Memory to AI

Large Language Models (LLMs) exhibit remarkable abilities in understanding natural language, but they are not without limitations. One area where LLMs can struggle is in processing long text inputs,…

Llama3.1 Model Quality Evaluation: Cerebras, Groq, SambaNova, Together, and Fireworks

Introduction At Cerebras, we are redefining AI inference by delivering unparalleled speed, quality, and efficiency. Our new inference solution sets an industry benchmark, delivering 1800+ tokens per…

Introducing Cerebras Inference: AI at Instant Speed

Today, we are announcing Cerebras inference – the fastest AI inference solution in the world. Cerebras inference delivers 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for…

Introducing DocChat: GPT-4 Level Conversational QA Trained In a Few Hours

We are excited to announce the release of Cerebras DocChat, our first iteration of models designed for document-based conversational question answering. This series includes two models: Cerebras…

Revolutionizing Life Science and Healthcare with Generative AI

Introduction  Healthcare currently accounts for 17% of GDP in the United States, making it one of the country’s largest economic sectors and an industry with immense potential to transform the human…