High Throughput,
Low Cost
Cerebras inference supports hundreds of concurrent users, enabling high throughput at the lowest cost.
128K Context Length
Use up to 128K context on Cerebras Inference for the highest performance on long inputs.
Our Partners
“Scaling inference is critical for accelerating AI and open source innovation. Thanks to the incredible work of the Cerebras team, Llama 3.1-405B is now the world’s fastest frontier model running at a rapid-fire pace of 969 tokens/sec. We are thrilled to support Cerebras’s latest breakthrough as they continue to push the boundaries in compute for AI.”
Ahmad Al-Dahle
VP of GenAI at Meta
Ahmad Al-Dahle
"DeepLearning.AI has multiple agentic workflows that require prompting an LLM repeatedly to get a result. Cerebras has built an impressively fast inference capability which will be very helpful to such workloads."
Andrew Ng
Founder, DeepLearning AI
Andrew Ng
“For traditional search engines, we know that lower latencies drive higher user engagement and that instant results have changed the way people interact with search and with the internet. At Perplexity, we believe ultra-fast inference speeds like what Cerebras is demonstrating can have a similar unlock for user interaction with the future of search - intelligent answer engines.”
Denis Yarats
CTO and co-founder, Perplexity
Denis Yarats
“With infrastructure, speed is paramount. The performance of Cerebras Inference supercharges Meter Command to generate custom software and take action, all at the speed and ease of searching on the web. This level of responsiveness helps our customers get the information they need, exactly when they need it in order to keep their teams online and productive."
Anil Varanasi
CEO of Meter
Anil Varanasi
“When building voice AI, inference is the slowest stage in your pipeline. With Cerebras Inference, it’s now the fastest. A full pass through a pipeline consisting of cloud-based speech-to-text, 70B-parameter inference using Cerebras Inference, and text-to-speech, runs faster than just inference alone on other providers. This is a game changer for developers building voice AI that can respond with human-level speed and accuracy.”
Russ d'Sa
CEO, LiveKit
Russ d'Sa
“Our customers are blown away with the results! Time to completion on Cerebras is hands down faster than any other inference provider and I’m excited to see the production applications we’ll power via the Cerebras inference platform.”
Akash Sharma
Akash Sharma
“We migrated from a leading GPU solution to Cerebras and reduced our latency by 75%.”
Hassaan Raza
CEO, Tavus
Hassaan Raza
“With Cerebras’ inference speed, GSK is developing innovative AI applications, such as intelligent research agents, that will fundamentally improve the productivity of our researchers and drug discovery process.”
Kim Branson
SVP of AI and ML, GSK
Kim Branson
"For real-time voice interactions, every millisecond counts in creating a seamless, human-like experience. Cerebras’ fast inference capabilities empower us to deliver instant voice interactions to our customers, driving higher engagement and expected ROI"
Seth Siegel
CEO, Audivi AI
Seth Siegel
Hundreds of billions
of tokens per day
Cerebras Inference is built to scale. Powered by data centers across the US, Cerebras Inference has capacity to serve hundreds of billions of tokens per day with leading accuracy and reliability.
August 27, 2024
Introducing Cerebras Inference: AI at Instant Speed
Today, we are announcing Cerebras inference – the fastest AI inference solution in the world. Cerebras inference delivers 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B, which is 20x faster than NVIDIA GPU-based hyperscale clouds.