there are majorly two reasons for unstable training of your llm
over-fitting —> when the model instead of learning starts memorizing.
Exploding Gradients —> when a lot of parameters in your llm become >1, because the gradients are generally multiplication of lot of parameters , this becomes NaN number and the trianing crashes
Vanishing Gradients —> similar to Exploading Gradients, just the parameters become < 1
Surprisingly, the solutions are simple; they’re familiar you just didn’t notice their use cases.
over-fitting
Dropout Layer
For a given layer's output (or input to the next layer), you apply a Dropout layer.
During each forward pass of a training step, each neuron (or activation) in that layer's output has a probability p (e.g., p=0.1) of being randomly set to zero.
The remaining non-zero neurons are scaled up by a factor of 1 / (1 - p). This is called "inverted dropout" and it ensures that the expected sum of the outputs remains the same, which keeps the learning dynamics stable.
It prevents a neuron from becoming overly dependent on the presence of a few specific other neurons
Weight Decay (L2 Regularization)
L_total = L_main + (λ/2) * Σ(w^2)
Σ(w^2) is the sum of the squares of all the individual weight parameters w in the entire model. This is the L2 norm of the weight vector.
λ (lambda) is a hyperparameter called the weight decay rate
By adding this term to the loss, we are telling the optimizer: "Minimize the prediction error, BUT ALSO try to keep the weights as small as possible."
A model with large weights is often a sign of overfitting. Large weights mean the model is making very sharp, specific decisions based on small changes in the input
Exploading/Vanishing Gradients
Layer Normalization
They are inserted between other layers (typically after a linear layer and before an activation function) and their sole job is to rescale their input tensor to have a standard, predictable distribution, usually with a mean of 0 and a variance of 1. This keeps the numbers flowing through the network in a "well-behaved" range
All normalization layers also introduce two learnable parameters, gamma (scale) and beta (shift), after normalizing. This allows the network to learn if, for some reason, a different mean and variance is actually optimal for the next layer. y_out = gamma * y_normalized + beta
RMSNorm —> just a better version of layerNorm
LayerNorm: y = (x - mean(x)) / sqrt(variance(x) + eps)
RMSNorm: y = x / sqrt(mean(x^2) + eps)