The Seventh Annual Conference on Machine Learning and Systems
Dates: May 13th – May 16th, 2024
Location: Santa Clara Convention Center, Santa Clara, CA
Technical Posters
Come stop by the booth to discuss the following posters
Sparse-IFT
Sparse-IFT improves training efficiency by using sparsity to boost accuracy while maintaining training FLOPs equivalent to a dense model. It is a collection of transforms that are drop-in replacements for dense layers, applicable to all deep learning architectures. These transforms increase layers' representational capacity and facilitate discovery of optimal sparse sub-networks, with the sparsity level as a single hyper-parameter.
Qualcomm Inference
Cerebras Systems and Qualcomm Technologies, Inc. are collaborating to accelerate LLM inference on the cloud by up to 10x through hardware-aware training. The Cerebras CS-3 and Wafer-Scale Cluster enable advanced training and finetuning, while the Qualcomm Cloud AI100 optimizes inference using techniques like unstructured sparsity, microscaling quantization, speculative sampling, network architecture search, and distillation.
Neural Magic Sparsity
We present a novel approach that creates accurate, sparse foundational LLMs with full accuracy recovery for fine-tuning tasks up to 70% sparsity. For Llama2-7B, we combine one-shot SparseGPT with sparse re-training on Cerebras CS-3 Wafer-Scale cluster, demonstrating near-theoretical training acceleration. Fine-tuning on diverse tasks and deploying on CPUs using Neural Magic's DeepSparse library yields up to 3x faster inference with 70% sparse model, further enhanced by quantization for up to 8.6x speedup.
Cerebras CS-3
Introducing the CS-3, our third-generation wafer-scale AI accelerator purposely built to train the most advanced AI models. With over 4 trillion transistors – 57x more than the largest GPU – the CS-3 is 2x faster than its predecessor and sets records in training large language and multi-modal models. The CS-3 is built to scale: using our next generation SwarmX interconnect, up to 2048 CS-3 systems can be linked together to build hyper-scale AI supercomputers of up to a quarter of a ZettaFLOPs (10^21). The CS-3 can be configured with up to 1,200 terabytes of external memory – allowing a single system to train models of up to 24 trillion parameters, paving the way for ML researchers to build models 10x larger than GPT-4 and Claude.