Back to blog

Sep 23 2024

The Practitioner’s Guide to the Maximal Update Parameterization - Cerebras

Introduction

Maximal Update Parameterization (µP) offers significant advantages for neural network training, but its adoption has been limited due to the complexity of the underlying math and the challenges in implementation. This guide aims to lower those barriers by providing a clear and practical overview of µP. By using µP, you can achieve stable hyperparameters across model scales, reduce the need for costly tuning, and improve training stability at large scale. This guide will walk you through the core concepts and practical steps needed to implement µP effectively, enabling you to take full advantage of its benefits without the usual hurdles.

We provide a simple port of μP to the popular nanoGPT library at https://github.com/EleutherAI/nanoGPT-mup, and encourage readers to refer to this implementation throughout this blog.

Why you should use μP

First we will explain why you should be using μP. There are four main benefits compared to standard parameterization (SP) models.

1. Stable optimum HPs across scale (μTransfer)

In Figure 1, Yang et al. (2021) showed that when using the standard parameterization (SP), optimal HPs vary with model width. μP reparameterizes a network such that the optimal HPs remain stable.

As a result of the HP shift when training models with SP, prior works have tested and found empirically that learning rates change as model size increases. Figure 2 shows the tuned max learning rate plotted against model width for a range of popular SP-trained large language models Brown et al. (2020); Touvron et al. (2023); Rae et al. (2021); Hoffmann et al. (2022); Smith et al. (2022). Here, the community has used very expensive manual tuning and testing to find that maximum learning rates roughly follow a trend (similar to the μP trend!). Interestingly, the larger scale models slightly diverge from the trend. This could be indicative of sub-optimal learning rate tuning due to the prohibitively expensive tuning cost and attempts to avoid instability. By adopting μP, one can automate much of the tuning required with SP models for free.

Since the first set of SP large language model (LLM) families, it has been commonplace to reuse the HPs of predecessors corresponding to the model size being trained. This approach inherits the poor tuning of larger scale models. Furthermore, this approach can’t be used for new architectures or optimizers, so researchers must take on the burden of manual tuning themselves. The prohibitive cost of tuning makes it artificially harder for new techniques to disrupt the existing training recipes (Parameterization Lottery).

2. Improved loss at large scale due to improved HP tuning

As model size grows it becomes more expensive to perform an extensive HP search, resulting in sub-optimally tuned large models. Yang et al. (2021) showed that by performing a 200 sample random HP search with a 40M parameter model, they could use the optimal HPs on a GPT-3 6.7B run and achieve comparable performance to GPT3-13B (Brown et al., 2020). In other words, that roughly translates to a 2x compute savings to reach the same performance! Additionally, Dey et al. (2023b) performed training recipe ablations with a 111M parameter model, then transferred their findings to a 3B parameter model and achieved performance comparable to contemporary 7B parameter models, while using 3.3x less training FLOPs!

3. Stable training – significantly decreased danger of instability at large scale

LLM training is notoriously prone to instability (see OPT logbook for SP training challenges (Zhang et al., 2022b)). Instability can present itself in the form of NaN loss, loss spikes, and/or loss divergences. When encountering instability, simple workarounds include resuming training with a lower learning rate (Zhang et al., 2022a) and/or skipping the data batches around where the instability occurred (Chowdhery et al., 2022).

While adopting μP does not completely solve the problem of instability, it certainly does eliminate HP selection as a major source of instability. Practitioners will still need to be mindful of precision, numerical stability, hardware failures, outlier data, etc. Anecdotally, since adopting μP at Cerebras, we seldom encounter training instability.

4. More predictable scaling due to μTransfer

For projects involving large-scale training, it is useful to fit scaling laws and be able to accurately extrapolate the performance a model will achieve given a compute budget. Dey et al. (2023a); Yao et al. (2024) showed that μP models can achieve much tighter scaling law fits than SP models due to having consistent training dynamics and HP tuning across model scales. More accurate model performance extrapolation at large scales can help projects more reliably achieve their performance targets.

μP enables better research

The benefits of μP add up to enable better research:

μP Alleviates the “Parameterization Lottery”. The techniques we develop are subject to the “Param- eterization Lottery” where research ideas can win because they are suited to existing hyperparameters and not because the idea is superior to alternative research directions (Analogous to the “Hardware Lottery” Hooker (2020)). Standard Parameterization (SP) studies run the risk of inconclusive or unpublished negative results due to the confounding variable of HP tuning. Research using μP can more robustly compare baselines with new methods, because optimal HPs are stable across model widths.

Simple and effective large-scale training. Large-scale training runs using μP enjoy better and more predictable performance with less worry of instability wasting compute time. Furthermore, μTransfer allows HP tuning budgets to be mostly reallocated towards training something new instead.

A Simple Approach to the μP Math

At a high-level, training neural networks is similar to simulating a partial differential equation that is developing over time. We would like that “simulation” to proceed smoothly and quickly, without any instabilities. To achieve stable and compute-efficient training, we can enforce certain invariants that keep each layer stable. Here we discuss the basic building blocks for these invariants, and then how they fit into layers and full models.

Basic Building Block: Controlled Activation Magnitudes

For each function we apply to a set of activations, we would like to ensure that, in expectation, the function does not cause the distribution of those activations to scale (change magnitude) with any model architecture hyperparameters, such as hidden size. Let’s start with a simple example: a fully-connected layer where input activations x are multiplied by the weight matrix W to produce output activations y.

Figure 3 diagrams the matrix multiplication, where the vector is multiplied by the weights matrix to produce vector . In the matrix multiply, is dot-product multiplied by each column of , so in those dot-products, each element of is first multiplied by the corresponding element in the column from W, and then the resulting values are reduced along ’s column dimension.

Suppose elements of are drawn from distribution , and we multiply by matrix with elements drawn from . If all activations and weights are independent, then the resulting vector will have elements drawn from (for ). If we choose , then y will have scale that is independent of the width of the layer! This sort of analysis may look familiar because it is used in popular initialization schemes like Glorot & Bengio (2010); He et al. (2015).

Abstracting this a bit… If you understand the simple example above, then you’re ready to abstract it toward controlling full training dynamics. The first thing to note is that if every operation in a model is controlled such that the outputs do not scale with model width, then we can change the model width without changing overall training dynamics. The “proof” of this is inductive: If a first operation controls its outputs to have consistent scale as its inputs, then when it passes its outputs to the next operation, that second operation will see well-controlled inputs, and so on. Thus, to achieve scalable training dynamics, it is sufficient to step through each operation in the model and verify that the scale of its output activations does not change with respect to changes in model width. In short: If we control the dynamics of each operation, we control the full model’s training dynamics.

Operations in a training step

The example above applies to activations. However, during training we also need to ensure the same controlled behavior for gradients and weight updates. Figure 4 diagrams these three components—the forward pass, backward pass, and weight update—for a single layer in a model, where and width multiplier . refer to the dimensions of the small din,base dout,base “proxy model” whose HPs we would like to transfer to a large model.

As we scale model width by multiplier md in a linear layer (i.e., F is a fully-connected layer), our aim is to control:

1. Forward pass:
2. Backward pass:
3. Effect of weight update on activations:

More formally, we want the norm of activations , gradients , and the effect of the weight update on activations to each be invariant to width multiplier . We can ensure this by controlling the mean and variance of each.

To control the forward pass, we can return to our earlier example but rather than making the scale of invariant to width , let’s make it invariant to the change in width . Then we can write and we can choose to ensure . Phrasing things in terms of md rather than d allows us to mimic the training dynamics of some baseline model as we scale up.

Conveniently, the backward pass calculation is analogous to the forward pass, so the calculation of the gradient,, follows the same math as the forward pass(e.g.,formatmul from the Figure 3). For the gradient to a matrix multiplication, the only difference from the forward pass is that the reduction dimension is the output dimension of the forward layer . We can make invariant to by setting to ensure . Typically when model width is scaled, each dimension of a hidden weight matrix is scaled equally: . This assumption of equal scaling allows the same initialization to control both the forward and backward passes, even for a rectangular weight matrix.

The last part of a layer that needs to be controlled is the weight update. The optimizer takes the gradient, the forward activations, and uses its internal state to calculate the weight update. The magnitude of the weight update is controlled by the learning rate, which we will use to ensure we maximally update the weights in expectation throughout training while maintaining stability. Calculating the correct learning rate for the weight update is a little trickier than the activation and gradient, because we need to estimate the scale of activations, , on the next training step. Namely, we want to choose the learning rate on training step , so that the output activations on the second training step have well-controlled size. Once again, assuming is a simple matrix multiplication:

(Eqn.1)

Since we have already controlled with the initialization above, we only need to consider the change due to the weight update; must scale independently of the model’s width. Here again, this calculation is structured analogously to the matrix multiply example in Figure 3. Unlike the simple example, however, the weight update and the forward activations on the second training step are no longer independent. They will have covariance, because and are drawn from the same distribution. Thus, the expectation of their dot-product is likely to be non-zero. In fact, by the Law of Large Numbers, this dot-product can be shown to grow proportionally to the change in width . Thus, to control the weight update in expectation, we can set 1. This derivation applies to both Stochastic Gradient Descent (SGD) and Adam optimizers, but note that accounting for optimizer transformations can be tricky, so we spare the reader from the complexity here.2.

Summary: For training, μP controls the forward and backward pass operations with weight initialization, and it controls the weight update using learning rate scaling.
For a more detailed derivation, refer to the Appendix.

Practitioner’s guide to μP

In this section we will explain how to implement, verify, and use μP for a transformer language model.

Implementation

The implementation is actually quite straightforward. Table 1 summarizes the necessary adjustments to implement μP for a Transformer with tied embeddings. It is common for research groups to work off of complex training codebases (e.g. Megatron-LM (Shoeybi et al., 2019), GPT-NeoX (Andonian et al., 2023), DeepSpeed (Rasley et al., 2020), timm (Wightman, 2019)) which makes it difficult to adopt the original μP library3. Internally, we found it simple to integrate μP into our existing code bases by making targeted changes to our code following Table 1. Here is the width multiplier and is the dimension of each attention head (typically 64 or 128). No additional corrections are needed for biases or layer-norm layers.

Table 1: Summary of SP and μP differences for a decoder-only transformer trained with Adam.

The learning rate and initialization variance of each hidden layer are scaled by , as we covered in the previous section. The attention logits are scaled by instead of to account for correlation between and that emerges during training. To support tied embedding weights, the embedding initialization must be the same as the unembedding initialization. To ensure proper scales of activations, the output logit forward pass is scaled by because the dot product reduces along elements to produce a -dimensional output. Finally, and are tunable scalars that can account for differences in embedding activation scales not proportional to , such a changing vocab size .

To find the optimal HPs, one must tune , , , and . One could also add tunable scalar base parameters anywhere else in the model, as long as they are fixed as varies.

To provide a concrete reference point, we also created a NanoGPT implementation which includes work- ing examples of verifying and using μP: https://github.com/EleutherAI/nanoGPT-mup. This codebase produced each of the figures in this section.

Coordinate check test

The coordinate check test is a simple and cheap way to test your implementation and should be your first verification step.

As we explained in the previous section, the goal of μP is to ensure the magnitude of the distribution of all activations is independent of any change in model width. To achieve this, activations must be controlled at initialization and after every training step. The coordinate check test involves training models of different widths for 10 steps. During each training step, we record the average size of activations for each layer type.

Our NanoGPT reference implementation includes a working example of the coordinate check test4 which produces all our coordinate check figures. In our coordinate check, we plot the mean absolute activation value, averaged across all layers of that type. This metric implicitly tests that both the mean and variance of activations are independent of change in model width. Note that typically the mean activation value is zero so one could simplify the y-axis further and only plot the variance of activations. Plotting the mean and variance separately could help debug more nuanced issues. We train a two layer GPT-2 model for ten steps for several different widths and five seeds.

First we perform the coordinate check for an SP model. Figure 5 shows that at each training step, activation size increases proportionally to model width. This is the source of optimum HP shift and instability in SP models.

Next we modify our parameterization to include the μP adjustments for hidden weight initialization variance: . Figure 6 shows this adjustment controls the size of hidden activations at initialization but μP base after each weight update, activation size grows proportional to model width.

Figure 6: Coordinate check for SP with μP hidden init. var. .

Next we modify our parameterization to include the μP adjustments for hidden learning rate: . Figure 7 shows these adjustments now ensure the size of hidden activations do not scale proportional to model width, but the output logit scale still grows.

Figure 7: Coordinate check for SP with μP hidden init. var. and μP hidden LR ()).

Next we modify our parameterization to include a partial μP adjustment for output logits: . Figure 8 shows these adjustments control the output logit scale at initialization, but there is still growth after a few steps.

Figure 8: Coordinate check for SP with μP hidden init. var. () and μP hidden LR ()) and a partial μP adjustment for output logits ().

The output logit multiplier is only suitable for the beginning of training where activations aren’t correlated yet. During later training, activations will correlate with weights, so a output logit multiplier is required, and we use this multiplier throughout training. Next we modify our parameterization to include the full μP adjustment for output logits: . Figure 9 shows these adjustments now pass the coordinate check test – the size of activations does not scale proportional to model width!

Figure 9: Coordinate check for SP with μP hidden init. var. () and μP hidden LR ()) and the μP adjustment for output logits ().

Finally, there is one more modification prescribed by μP: . The reasoning for this change is similar to the output logits multiplier: The keys and queries in the model are likely to rotate to align later in training. We modify our parameterization to include this and show that in Figure 10 that is has minimal effect. This is because this attention logit adjustment is meant to counteract the correlation of and that emerges later into training.

Figure 10: Coordinate check for μP

μTransfer test

The μTransfer test examines whether optimum HPs are stable when model width is varied (Figure 1). Once your coordinate check tests are looking good, we recommend running a μTransfer test as a final integration test. Our NanoGPT reference implementation includes a working example of the μTransfer test5 which produces Figures 12 and 11.

We test learning rate transfer on the openwebtext dataset. We again use two-layer GPT-2 models trained on 33M tokens with four different model widths and three seeds each using NVIDIA A100 GPU instances. Figure 11 shows the optimal learning rate remains stable as we vary model width for μP, unlike the SP models.

Figure 11: μTransfer learning rate test on 33M tokens from the openwebtext dataset.

We also include an even smaller scale test that can run on an Apple M1 Pro chip overnight. We train two-layer GPT-2 models for 1 epoch of the shakespeare_char dataset (1M tokens) with four different model widths and three seeds each. Figure 12 shows the optimal learning rate remains stable as we vary model width for μP, unlike the SP models.

Figure 12: μTransfer learning rate test on 1M tokens from the shakespeare_char dataset.

Transferring optimal HPs from a small scale to a large scale

Once you have validated your μP implementation through coordinate check and μTransfer tests, you are finally ready to use μP to improve large scale training runs. You can perform a random HP search over a small “proxy model”. Following Yang et al. (2021), we choose a hidden size of 256 to ensure a large-enough scale for the law of large numbers and central limit theorem to converge. We choose depth roughly equivalent to the large scale to mitigate the effect of depth shifting the optimum HPs Yang et al. (2023). We train our small proxy model for 20 tokens per parameter (following Hoffmann et al. (2022)) and perform a random search over four HPs: base initialization standard deviation , base learning rate , embedding multiplier , and output logit multiplier . Note that one could also define additional tunable scalar multiple hyperparameters. We find that if the proxy model is trained with a batch size smaller than the critical batch size (McCandlish et al., 2018), learning rate transfer to a large model trained at or above the critical batch size will be sub-optimal. Therefore it is important to train your proxy model with a large enough batch size. Anecdotally, at Cerebras we have observed excellent transfer across datasets, echoing the dataset transfer results of Yang et al. (2021). Finally we recommend re-tuning your HPs whenever you make a change to your model architecture (e.g. attention algorithm, nonlinearity, position embeddings, vocabulary size) or training procedure (e.g. learning rate schedule).

Conclusion

We hope this post has convinced you that μP is worth implementing and reduced the barriers for you to adopt it! We believe wider adoption and study of μP can raise the bar for deep learning research by helping to alleviate the Parameterization Lottery.

©Copyright Cerebras, Eleuther.

Citation

To cite our work, please use:

Footnotes

If and we instead had to control the variance, then would be the appropriate scaling. See Section 2.4 from \cite{scaling-exponents-across-parameterizations-and-optimizers} for a good discussion on this.
Sometimes, SGD optimizers will formulate the weight update to divide the gradient by the hidden size. By dividing out hidden size here, the learning rate correction for SGD will not need to contain a hidden size term. In particular, Yang et al. 2021 use this formulation for their derivation.
https://github.com/microsoft/mup
https://github.com/EleutherAI/nanoGPT-mup

References

Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, et al. GPT-NeoX: Large scale autoregressive language modeling in PyTorch. GitHub Repo, 9 2023. URL https://www.github.com/eleutherai/gpt-neox.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways, 2022.

Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Mar- vin Tom, and Joel Hestness. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster, 2023a. URL https://arxiv.org/abs/2304.03208.

Nolan Dey, Daria Soboleva, Faisal Al-Khateeb, Bowen Yang, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming, Chen, Robert Myers, et al. Btlm-3b-8k: 7b parameter performance in a 3b param- eter model, 2023b. URL https://arxiv.org/abs/2309.11568.

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander Alemi, Roman Novak, Peter Liu, Izzed- din Gur, Jascha Sohl-Dickstein, Leslie Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers, 07 2024. URL https://openreview.net/pdf/579c102a8c067102c85e27612c36d7a356ea9b0b.pdf.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human- level performance on imagenet classification. IEEE International Conference on Computer Vision (ICCV 2015), 1502, 02 2015. doi: 10.1109/ICCV.2015.123.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An Empirical Analysis of Compute-optimal Large Language Model Training. In The Conference on Neural Information Processing Systems (NeurIPS), 2022.

Sara Hooker. The hardware lottery, 2020. URL https://arxiv.org/abs/2009.06489.

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch
training, 2018. URL https://arxiv.org/abs/1812.06162.

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, 2021.

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, pp. 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/ 3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models using GPU Model Parallelism. arXiv preprint arXiv:1909.08053, 2019.

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model, 2022. URL https://arxiv.org/abs/2201.11990.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models, 2023. URL https://arxiv.org/abs/2302.13971.
Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.

Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11727–11737. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yang21c.html.

Greg Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. In Advances in Neural Information Processing Systems, 2021.

Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks, 2023. URL https://arxiv.org/abs/2310.02244.

Yiqun Yao, Siqi fan, Xiusheng Huang, Xuezhi Fang, Xiang Li, Ziyi Ni, Xin Jiang, Xuying Meng, Peng Han, Shuo Shang, et al. nanolm: an affordable llm pre-training benchmark via accurate loss prediction across scales, 2024. URL https://arxiv.org/abs/2304.06875.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models, 2022a.

Susan Zhang, Stephen Roller, Naman Goyal, and Sam Shleifer. Chronicles of opt development, 2022b. URL https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles.

Appendix

A more thorough math explanation

Throughout this section, we add a batch dimension to and such that and .

Forward pass at initialization

The first stage where we would like to control training dynamics is in the layer’s forward function. We can write the forward pass as:

(Eqn.2)

Our goal is to ensure is invariant to changes in width . To achieve this we can ensure the expected mean and variance of elements of are invariant to .

Mean: As expectation is linear and and are independent at initialization:

(Eqn.3)

Therefore, since at initialization , and the mean is controlled.

Variance: As expectation is linear and each weight element is IID:

(Eqn.4)

Then, since and are independent at initialization:

(Eqn.5)

Finally, since at initialization and redefining :

(Eqn.6)

Rewriting in terms of width multiplier :

(Eqn.7)

Solution: To ensure scales independently of , we choose to set . This ensures that is invariant to changes in width .

Backward gradient pass at initialization

The next stage we would like to control training dynamics is in the layer’s backward pass. We can rewrite the backward pass as:

(Eqn.8)

Our goal is to ensure is invariant to changes in width . To achieve this, we can ensure the expected mean and variance of elements of are invariant to .

Mean: As expectation is linear and and are (roughly) independent at initialization:

(Eqn.9)

Therefore, since at initialization , , the mean is controlled.

Variance: As expectation is linear and each weight element is IID:

(Eqn.10)

From the backward pass mean derivation, we know . Then, similar to the forward pass variance derivation, we can simplify using the facts that at initialization, and are (roughly) independent and . Similarly we can also define = and rewrite in terms of width multiplier :

(Eqn.11)

Solution: To ensure scales independently of , we choose to set . This ensures that is invariant to changes in width . Typically when model width is scaled, each dimension of a hidden weight matrix is scaled equally: . This assumption of equal scaling allows the same initialization to control both the forward and backward passes, even for a rectangular weight matrix.

Effect of weight update on activations

We desire that the Frobenius norm of the effect of the weight update on activations, , is invariant to changes in width . To achieve this we examine the expected size of each element. Here, is the learning rate.

(Eqn.12)

Mean: As expectation is linear.

(Eqn.13)

Since was derived from , there is covariance between these variables and is non-zero.

By the Law of Large Numbers:

(Eq.14)

SGD learning rate adjustment

Following the formulation in Yang et al. (2021), SGD weight updates take the form:

(Eqn.15)

so we can rewrite Equation 14 as:

(Eqn.16)

Solution: To ensure and are scale invariant to , we choose .

Adam learning rate adjustment

Following the formulation in Yang et al. (2021), Adam weight updates take the form:

(Eqn.17)

where is the current training step and are the moving average weights at each training step. We can
rewrite Equation 14 as:

(Eqn.18)

Rewriting in terms of width multiplier .

(Eqn.19)

Solution: To ensure and are scale invariant to , we choose .