Goyalayus

Notes, essays, and fragments from the edge of understanding.

Ensuring A Stable Pre-Train

August 29, 2025

Original Substack post

there are majorly two reasons for unstable training of your llm

over-fitting —> when the model instead of learning starts memorizing.

Exploding Gradients —> when a lot of parameters in your llm become >1, because the gradients are generally multiplication of lot of parameters , this becomes NaN number and the trianing crashes

Vanishing Gradients —> similar to Exploading Gradients, just the parameters become < 1

Surprisingly, the solutions are simple; they’re familiar you just didn’t notice their use cases.

over-fitting

Dropout Layer

  1. For a given layer's output (or input to the next layer), you apply a Dropout layer.

  2. During each forward pass of a training step, each neuron (or activation) in that layer's output has a probability p (e.g., p=0.1) of being randomly set to zero.

  3. The remaining non-zero neurons are scaled up by a factor of 1 / (1 - p). This is called "inverted dropout" and it ensures that the expected sum of the outputs remains the same, which keeps the learning dynamics stable.

It prevents a neuron from becoming overly dependent on the presence of a few specific other neurons

Weight Decay (L2 Regularization)

L_total = L_main + (λ/2) * Σ(w^2)

  • Σ(w^2) is the sum of the squares of all the individual weight parameters w in the entire model. This is the L2 norm of the weight vector.

  • λ (lambda) is a hyperparameter called the weight decay rate

By adding this term to the loss, we are telling the optimizer: "Minimize the prediction error, BUT ALSO try to keep the weights as small as possible."

A model with large weights is often a sign of overfitting. Large weights mean the model is making very sharp, specific decisions based on small changes in the input

Exploading/Vanishing Gradients

Layer Normalization

They are inserted between other layers (typically after a linear layer and before an activation function) and their sole job is to rescale their input tensor to have a standard, predictable distribution, usually with a mean of 0 and a variance of 1. This keeps the numbers flowing through the network in a "well-behaved" range

All normalization layers also introduce two learnable parameters, gamma (scale) and beta (shift), after normalizing. This allows the network to learn if, for some reason, a different mean and variance is actually optimal for the next layer. y_out = gamma * y_normalized + beta

RMSNorm —> just a better version of layerNorm

  • LayerNorm: y = (x - mean(x)) / sqrt(variance(x) + eps)

  • RMSNorm: y = x / sqrt(mean(x^2) + eps)