February 26, 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
November 18, 2024
Empirical Upper Bounds for Unstructured Sparsity in Compute-Efficient Language Modeling
November 01, 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
October 31, 2024
Sparse maximal update parameterization: A holistic approach to sparse training dynamics
October 13, 2024
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
September 04, 2024
Bilingual Adaptation of Monolingual Foundation Models
July 02, 2024
Bilingual Adaptation of Monolingual Foundation Models
May 20, 2024
MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
May 15, 2024
Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System
May 15, 2024
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
November 30, 2023
Efficient and Approximate Per-Example Gradient Norms for Gradient Noise Scale
November 13, 2023