Goyalayus

Have you ever noticed that in a deep learning codebase, weights are rarely initialized as plain $N (0, 1)$ ?

You usually see something like this instead:

W_{ij} \sim N (0, \frac{1}{n _{in}})

or sometimes:

W_{ij} \sim N (0, \frac{2}{n _{in}})

In code this often looks like drawing normal random numbers and dividing by the square root of the input dimension.

Why do we do this?

Because if we initialize every weight with variance $1$ , then as signals move forward through many layers, and as gradients move backward through those same layers, their scale can multiply again and again. After enough multiplications, the numbers either become huge or almost zero.

That is the exploding and vanishing gradients problem.

This post is the mathematics behind it.

The short version is:

Deep learning is repeated multiplication. Repeated multiplication is controlled by variance, derivatives, Jacobians, and singular values.

Once that sentence becomes obvious, Xavier initialization, He initialization, residual connections, normalization, and gradient clipping all start feeling like the same family of fixes.

Start With One Neuron

Consider a single neuron in one layer. Ignore the bias for now.

y = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n}

or compactly:

y = i = 1 \sum n w_{i} x_{i}

Here $n$ is the number of input dimensions, usually called $n_{in}$ .

Let us make three simple assumptions:

The inputs are normalized:

E [x_{i}] = 0, V a r (x_{i}) = 1

The weights are initialized independently:

E [w_{i}] = 0, V a r (w_{i}) = σ_{w}^{2}

The $x_{i}$ and $w_{i}$ are independent of each other.

We want to know the variance of $y$ .

Since each term is $w_{i} x_{i}$ :

V a r (w_{i} x_{i}) = E [w_{i}^{2} x_{i}^{2}] - E [w_{i} x_{i}]^{2}

Because the weights and inputs are independent and both have mean zero:

E [w_{i} x_{i}] = 0

and:

E [w_{i}^{2} x_{i}^{2}] = E [w_{i}^{2}] E [x_{i}^{2}]

Since $E [w_{i}^{2}] = V a r (w_{i}) = σ_{w}^{2}$ and $E [x_{i}^{2}] = V a r (x_{i}) = 1$ :

V a r (w_{i} x_{i}) = σ_{w}^{2}

Now $y$ is the sum of $n$ independent terms, so variances add:

V a r (y) = i = 1 \sum n V a r (w_{i} x_{i})

V a r (y) = n σ_{w}^{2}

This is the whole problem in one line.

If we initialize with $σ_{w}^{2} = 1$ , then:

V a r (y) = n

So if the layer has 768 inputs, the output variance becomes about 768 after just one linear layer. That is already too large.

To keep the output variance around $1$ , we want:

n σ_{w}^{2} = 1

so:

σ_{w}^{2} = \frac{1}{n}

and therefore:

σ_{w} = \frac{1}{n}

This is the basic statistical reason behind dividing by $n_{in}$ .

Why This Becomes Worse Across Layers

One layer is not the problem. Deep learning means stacking many layers.

Suppose a simple feed-forward network is:

h_{0} = x

z_{l} = W_{l} h_{l - 1}

h_{l} = ϕ (z_{l})

where $ϕ$ is an activation function like tanh, sigmoid, ReLU, or GELU.

If we temporarily ignore the activation, and each layer multiplies variance by some factor $c$ , then after $L$ layers:

V a r (h_{L}) \approx c^{L} V a r (x)

If $c > 1$ , activations explode.

If $c < 1$ , activations vanish.

If $c \approx 1$ , the network has a chance of preserving signal scale.

This is why tiny errors in scale become massive in deep networks. A factor of $1.2$ seems harmless for one layer, but over 100 layers:

1. 2^{100} \approx 82, 817

A factor of $0.8$ seems harmless too, but:

0. 8^{100} \approx 2.0 \times 1 0^{- 10}

So a deep network does not merely care whether a layer is mathematically valid. It cares whether repeated composition keeps signals in a usable numerical range.

Backpropagation Is The Same Problem In Reverse

Now let us look at gradients.

Suppose the loss is $L$ and we want gradients with respect to earlier layers.

Backpropagation repeatedly applies the chain rule.

For a network:

h_{L} = f_{L} (f_{L - 1} (\dots f_{1} (x)))

we get:

\frac{\partial L}{\partial h _{l - 1}} = \frac{\partial h _{l}}{\partial h _{l - 1}}^{T} \frac{\partial L}{\partial h _{l}}

The object:

J_{l} = \frac{\partial h _{l}}{\partial h _{l - 1}}

is the Jacobian of layer $l$ .

So the gradient at an early layer is:

\frac{\partial L}{\partial h _{0}} = J_{1}^{T} J_{2}^{T} \dots J_{L}^{T} \frac{\partial L}{\partial h _{L}}

Again, repeated multiplication.

This time we are multiplying Jacobian matrices instead of scalar variances.

If this product grows in norm, gradients explode. If this product shrinks in norm, gradients vanish.

That is the mathematical heart of the issue.

The Jacobian Of A Layer

For one layer:

z_{l} = W_{l} h_{l - 1}

h_{l} = ϕ (z_{l})

The Jacobian is:

J_{l} = \frac{\partial h _{l}}{\partial h _{l - 1}}

Using the chain rule:

J_{l} = D_{l} W_{l}

where:

D_{l} = d ia g (ϕ^{'} (z_{l}))

So each layer's Jacobian has two parts:

$W_{l}$ , the weight matrix.
$D_{l}$ , the diagonal matrix of activation derivatives.

This is important because gradients can vanish or explode because of either part.

Bad weight scale can break training.

Bad activation derivatives can also break training.

Usually both interact.

Singular Values: The Real Scale Of A Matrix

For scalars, repeated multiplication is easy to understand. If you multiply by $2$ many times, you explode. If you multiply by $0.5$ many times, you vanish.

For matrices, the closest idea is singular values.

A matrix $A$ stretches some directions and shrinks others. Its largest singular value tells us the maximum stretch:

∥ A x ∥ \leq σ_{ma x} (A) ∥ x ∥

If the Jacobians have singular values much larger than $1$ , gradients can explode.

If the Jacobians have singular values much smaller than $1$ , gradients vanish.

For many layers:

J_{1}^{T} J_{2}^{T} \dots J_{L}^{T} g

is controlled by the product of the singular values along the directions the gradient travels.

A rough bound is:

J_{1}^{T} J_{2}^{T} \dots J_{L}^{T} g \leq (l = 1 \prod L σ_{ma x} (J_{l})) ∥ g ∥

So even if every layer has maximum singular value only $1.1$ , after 100 layers the upper bound contains:

1. 1^{100} \approx 13, 780

And if every layer has singular values around $0.9$ :

0. 9^{100} \approx 0.000026

This is why people often talk about keeping Jacobian singular values near $1$ . The ideal is sometimes called dynamical isometry: signals and gradients pass through depth without being crushed or inflated too much.

Eigenvalues: The RNN Version

Eigenvalues become especially intuitive in recurrent neural networks.

A simple RNN has hidden state:

h_{t} = ϕ (W_{h} h_{t - 1} + W_{x} x_{t} + b)

The same recurrent matrix $W_{h}$ is reused at every time step.

During backpropagation through time, the gradient from time $T$ back to time $t$ contains products like:

\frac{\partial h _{T}}{\partial h _{t}} = k = t + 1 \prod T \frac{\partial h _{k}}{\partial h _{k - 1}}

For the RNN:

\frac{\partial h _{k}}{\partial h _{k - 1}} = D_{k} W_{h}

where:

D_{k} = d ia g (ϕ^{'} (W_{h} h_{k - 1} + W_{x} x_{k} + b))

So:

\frac{\partial h _{T}}{\partial h _{t}} = (D_{T} W_{h}) (D_{T - 1} W_{h}) \dots (D_{t + 1} W_{h})

If we ignore the changing activation derivative for a moment, the core repeated object is:

W_{h}^{T - t}

Now eigenvalues tell the story.

If $W_{h}$ has an eigenvector $v$ with eigenvalue $λ$ , then:

W_{h} v = λ v

and after repeated multiplication:

W_{h}^{k} v = λ^{k} v

If $∣ λ ∣ > 1$ , that direction explodes.

If $∣ λ ∣ < 1$ , that direction vanishes.

If $∣ λ ∣ \approx 1$ , that direction can preserve information over time.

This is why vanilla RNNs struggle with long-range dependencies. To remember something from 500 time steps ago, the gradient has to survive hundreds of repeated multiplications.

That is a brutal requirement.

Activation Functions Can Kill Gradients Too

Weights are only half the story. Activations matter because their derivatives appear inside every Jacobian.

Sigmoid

The sigmoid is:

σ (x) = \frac{1}{1 + e ^{- x}}

Its derivative is:

σ^{'} (x) = σ (x) (1 - σ (x))

The maximum value of this derivative is $0.25$ .

So every sigmoid layer contributes a derivative factor at most $0.25$ .

Across many layers, this is terrible:

0.2 5^{20} \approx 9.1 \times 1 0^{- 13}

This is why old deep networks with sigmoid activations were difficult to train.

Also, sigmoid saturates. For large positive or negative inputs, the output becomes close to $1$ or $0$ , and the derivative becomes almost zero.

So once a unit saturates, learning through it becomes tiny.

Tanh

Tanh is better than sigmoid because it is zero-centered:

tanh (x) \in [- 1, 1]

Its derivative is:

\frac{d}{d x} tanh (x) = 1 - tanh^{2} (x)

The maximum derivative is $1$ , but for large $∣ x ∣$ , tanh also saturates and the derivative becomes close to zero.

So tanh can still vanish if activations drift into saturation.

ReLU

ReLU is:

ϕ (x) = max (0, x)

Its derivative is:

ϕ^{'} (x) = {1, 0, x > 0 x < 0

ReLU helps because positive units pass gradient with derivative $1$ .

But roughly half the units may be inactive at initialization if pre-activations are symmetric around zero. Those inactive units have derivative $0$ .

So ReLU does not shrink every active path, but it drops many paths entirely.

This is why ReLU networks usually use He initialization instead of Xavier.

Xavier Initialization

Xavier initialization tries to keep variance stable through layers.

For activations that are roughly symmetric and not half-zeroed like ReLU, a common choice is:

V a r (W_{ij}) = \frac{1}{n _{in}}

or, balancing both forward and backward flow:

V a r (W_{ij}) = \frac{2}{n _{in} + n _{o u t}}

The second form is often called Glorot or Xavier initialization.

Why include $n_{o u t}$ ?

Forward propagation cares about $n_{in}$ because each output sums over input dimensions.

Backward propagation cares about $n_{o u t}$ because each input receives gradient contributions from output dimensions.

So Xavier balances both directions:

V a r (h_{l}) \approx V a r (h_{l - 1})

and:

V a r (\frac{\partial L}{\partial h _{l - 1}}) \approx V a r (\frac{\partial L}{\partial h _{l}})

It is not magic. It is variance bookkeeping.

He Initialization

For ReLU, about half the units are zeroed out at initialization.

So if we used:

V a r (W_{ij}) = \frac{1}{n _{in}}

then the variance after ReLU would roughly be cut in half.

To compensate, He initialization uses:

V a r (W_{ij}) = \frac{2}{n _{in}}

or:

W_{ij} \sim N (0, \frac{2}{n _{in}})

This keeps the post-ReLU signal variance closer to stable.

In code, this is why you often see:

s t d = \frac{2}{n _{in}}

for ReLU-style networks.

For transformers using GELU, initialization is more architecture-specific, but the same principle remains: keep signal and gradient scales from drifting too fast with depth.

A Tiny Numerical Example

Suppose every layer accidentally multiplies gradient norm by $0.7$ .

After 10 layers:

0. 7^{10} \approx 0.028

After 50 layers:

0. 7^{50} \approx 1.8 \times 1 0^{- 8}

After 100 layers:

0. 7^{100} \approx 3.2 \times 1 0^{- 16}

That gradient is basically gone.

Now suppose every layer multiplies gradient norm by $1.3$ .

After 10 layers:

1. 3^{10} \approx 13.8

After 50 layers:

1. 3^{50} \approx 497, 929

After 100 layers:

1. 3^{100} \approx 2.48 \times 1 0^{11}

That gradient is unusably large.

So the problem is not that gradients are mysterious. The problem is that exponentials are unforgiving.

Why RNNs Were Especially Painful

Feed-forward networks multiply across depth.

RNNs multiply across time.

A sequence length of 1,000 means the model may need useful gradients across 1,000 recurrent steps.

For a vanilla RNN, the gradient from a late output to an early hidden state contains:

k = t + 1 \prod T D_{k} W_{h}

Even if the recurrent matrix is well-scaled at initialization, training changes it. And even if $W_{h}$ is fine, the activation derivatives can shrink gradients.

This is one reason LSTMs and GRUs were such a big deal. They create gated paths where information can flow additively, not only through repeated nonlinear matrix multiplication.

A simplified LSTM cell state update is:

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t}

The old memory $c_{t - 1}$ is multiplied by a forget gate $f_{t}$ , but it also has a relatively direct path forward.

The derivative:

\frac{\partial c _{t}}{\partial c _{t - 1}} = f_{t}

If $f_{t}$ is near $1$ , the gradient can pass through many time steps more easily.

This does not make LSTMs perfect, but it explains why they helped long before transformers became dominant.

Residual Connections

Residual connections attack the same problem from another angle.

Instead of making a layer learn:

h_{l + 1} = F (h_{l})

a residual block does:

h_{l + 1} = h_{l} + F (h_{l})

Now the Jacobian is:

\frac{\partial h _{l + 1}}{\partial h _{l}} = I + \frac{\partial F}{\partial h _{l}}

The identity matrix $I$ gives gradients a direct route backward.

So even if $\frac{\partial F}{\partial h _{l}}$ is small, the gradient is not forced to pass only through $F$ .

This is a major reason very deep ResNets and transformers train at all.

Residual connections do not remove the need for good initialization or normalization, but they make the optimization landscape much more forgiving.

Normalization

Normalization methods also fight scale drift.

BatchNorm normalizes using batch statistics. LayerNorm normalizes within each example across features.

A simplified LayerNorm is:

μ = \frac{1}{d} i = 1 \sum d x_{i}

σ^{2} = \frac{1}{d} i = 1 \sum d (x_{i} - μ)^{2}

LayerNorm (x)_{i} = γ_{i} \frac{x _{i} - μ}{σ ^{2} + ϵ} + β_{i}

This keeps activations in a controlled range before the next transformation.

In transformers, LayerNorm is one of the quiet heroes. Without it, residual streams and attention/MLP blocks can drift in scale as depth increases.

There are two common layouts:

Post-norm:

h_{l + 1} = LayerNorm (h_{l} + F (h_{l}))

Pre-norm:

h_{l + 1} = h_{l} + F (LayerNorm (h_{l}))

Pre-norm transformers are usually easier to train at large depth because the residual path remains more direct.

Again, this is gradient-flow engineering.

Gradient Clipping

Gradient clipping is mostly a fix for exploding gradients.

If the gradient vector is $g$ and its norm is too large, we rescale it:

g \leftarrow g \cdot \frac{τ}{∥ g ∥}

when:

∥ g ∥ > τ

Here $τ$ is the clipping threshold.

So if the gradient norm becomes $1000$ and the threshold is $1$ , clipping rescales the whole gradient to norm $1$ .

This does not solve vanishing gradients. It also does not fix the underlying reason the gradient exploded. But it prevents a single bad update from destroying the model parameters.

RNN training historically used gradient clipping a lot because exploding gradients were common during backpropagation through time.

Modern large-model training also often clips gradients because occasional spikes are normal at scale.

A Concrete Mental Model

Imagine a signal moving through a deep network.

At each layer, three things happen:

The weight matrix stretches or shrinks the vector.
The activation derivative keeps, shrinks, or kills gradient paths.
The next layer repeats the process.

Forward pass:

h_{0} \to h_{1} \to h_{2} \to \dots \to h_{L}

Backward pass:

\frac{\partial L}{\partial h _{L}} \to \frac{\partial L}{\partial h _{L - 1}} \to \dots \to \frac{\partial L}{\partial h _{0}}

The forward pass can lose information if activations collapse or explode.

The backward pass can lose learning signal if Jacobian products shrink or explode.

Training works best when both stay in a reasonable range.

That is why initialization is not a small implementation detail. It determines the starting geometry of learning.

Why Plain Normal(0, 1) Is Bad

Let us return to the original question.

If:

W_{ij} \sim N (0, 1)

and a layer has $n_{in}$ inputs, then:

V a r (y) = n_{in}

For $n_{in} = 1024$ :

V a r (y) = 1024

so the standard deviation is:

1024 = 32

After one layer, activations are already much larger than the normalized input.

If you use tanh or sigmoid, this pushes units into saturation, where derivatives are near zero. So the forward activations become uninformative and the backward gradients vanish.

If you use ReLU, large activations can propagate and make later layers unstable. Gradients can become huge because matrix products keep stretching them.

So Normal$(0,1)$ is usually wrong not because randomness is bad, but because the randomness has the wrong scale for the number of summed inputs.

The Main Fixes, In One Place

Initialization controls the starting scale:

V a r (W_{ij}) \approx \frac{1}{n _{in}}

or for ReLU:

V a r (W_{ij}) \approx \frac{2}{n _{in}}

Activation choice controls derivative behavior. Sigmoid and tanh can saturate. ReLU and GELU usually preserve gradients better, but still need good scale.

Normalization keeps activations from drifting too far during training.

Residual connections create direct gradient paths:

\frac{\partial}{\partial h _{l}} (h_{l} + F (h_{l})) = I + F^{'} (h_{l})

Gradient clipping prevents occasional explosions from causing catastrophic updates.

Gated architectures like LSTMs create paths where memory and gradients can survive over time.

All of these are different ways of fighting the same mathematical enemy: bad products of many Jacobians.

Final Intuition

Exploding and vanishing gradients are not random bugs in neural networks.

They are what you should expect when you repeatedly multiply by matrices and activation derivatives.

In a deep feed-forward network, the multiplication happens across layers.

In an RNN, it happens across time.

The gradient is basically carried by a product like:

J_{1}^{T} J_{2}^{T} \dots J_{L}^{T}

or for an RNN:

(D_{T} W_{h}) (D_{T - 1} W_{h}) \dots (D_{t + 1} W_{h})

If the product's effective singular values are mostly above $1$ , gradients explode.

If they are mostly below $1$ , gradients vanish.

If they stay near $1$ , learning can flow.

That is why we divide by $n$ .

That is why Xavier and He initialization exist.

That is why residual connections, normalization, gates, and clipping are everywhere.

Deep learning works when the model is expressive enough to learn complicated functions, but numerically disciplined enough that the learning signal can actually travel through it.

Mathematics behind Exploding and Vanishing Gradients

Start With One Neuron

Why This Becomes Worse Across Layers

Backpropagation Is The Same Problem In Reverse

The Jacobian Of A Layer

Singular Values: The Real Scale Of A Matrix

Eigenvalues: The RNN Version

Activation Functions Can Kill Gradients Too

Sigmoid

Tanh

ReLU

Xavier Initialization

He Initialization

A Tiny Numerical Example

Why RNNs Were Especially Painful

Residual Connections

Normalization

Gradient Clipping

A Concrete Mental Model

Why Plain Normal(0, 1) Is Bad

The Main Fixes, In One Place

Final Intuition