Cerebras

Inference

The world's fastest Inference—available today.

TRY CHAT

The World's

Fastest Inference

Cerebras Inference Llama 3.3 70B runs at 2,200 tokens/s and Llama 3.1 405B at 969 tokens/s – over 70x faster than GPU clouds. Get instant responses to code-gen, summarization, and agentic tasks.

High Throughput,
Low Cost

Cerebras inference supports hundreds of concurrent users, enabling high throughput at the lowest cost.

128K Context Length

Use up to 128K context on Cerebras Inference for the highest performance on long inputs.

Our Partners

Hundreds of billions
of tokens per day

Cerebras Inference is built to scale. Powered by data centers across the US, Cerebras Inference has capacity to serve hundreds of billions of tokens per day with leading accuracy and reliability.

Try the world's fastest inference

TRY CHAT

August 27, 2024

Introducing Cerebras Inference: AI at Instant Speed

Today, we are announcing Cerebras inference – the fastest AI inference solution in the world. Cerebras inference delivers 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B, which is 20x faster than NVIDIA GPU-based hyperscale clouds.

LEARN MORE