Cerebras

Inference

The world's fastest Inference—available today.

TRY CHAT

The World's

Fastest Inference

Cerebras Inference Llama3.1-70B runs at 450 tokens/s—20x faster than GPU-based Hyperscale Clouds. Get instant responses to code-gen, summarization, and agentic tasks.

High Throughput,
Low Cost

Cerebras inference supports hundreds of concurrent users, enabling high throughput at the lowest cost.
Build with the fastest Llama-3.1-70B starting at just 60c per million token.

Full 16-Bit Precision

Cerebras inference uses the reference weights of Llama3.1 at 16-bit precision, ensuring the highest accuracy for your queries.

Unrivaled Price-Performance for AI Inference

Cerebras is the industry’s fastest inference API with the best price-performance according to Artificial Analysis.


Our Partners

Hundreds of billions
of tokens per day

Cerebras Inference is built to scale. Powered by data centers across the US, Cerebras Inference has capacity to serve hundreds of billions of tokens per day with leading accuracy and reliability.

Try the world's fastest inference

TRY CHAT

August 27, 2024

Introducing Cerebras Inference: AI at Instant Speed

Today, we are announcing Cerebras inference – the fastest AI inference solution in the world. Cerebras inference delivers 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B, which is 20x faster than NVIDIA GPU-based hyperscale clouds.

LEARN MORE