Goyalayus

Original Substack post

i always struggled remembering formulas for what would be the gradient of a particular layer in the transformer. recently while studying about distributed training I realized that I can not move forward without having a crystal clear understanding of how gradients flew in the network

so here is a blog which will teach you all you need to be a wizard of gradients

Rule 1

suppose we want to find the gradient dL/dW where L is the loss and W is the gradient. the rule says that the shape of dL/dW will be equal to the shape of W, memorize this.

x→[YOUR LAYER (f)]→y→[…Rest of Network…]→L

clearly x is the input here, f(x) is the layer about which we care, wow it rhymes

y is the output of the layer, y = f(x) and L is the final loss.

what we are interested in is dL/dX and dL/dF, now F can be a weight matrix or some other function which has learnable parameters to it

Memorize

dL/dX = dL/dY * dY/dX

generally the f is of two types, weight matrices and element wise operations. we are going to look at element wise operations for now

an “element-wise” operation means the math happens to each number in the matrix independently. none of the numbers “talk” to each other.

Examples:

Y = X + Z (Matrix Addition)

Y = ReLU(X) (Activation)

Y = X^2 (Square every element)

Rule for Element Wise Operations (memorize this)

you might be wondering what does this circle and a dot between it means

that is the hadamard product (element-wise multiplication). it means: multiply the top-left of a with the top-left of b, top-right with top-right, etc. no fancy row-column dot products here. just simple multiplication.

example 1

Y = ReLU(x) (rule: if x>0, keep it. if x≤0, set to 0)

so dY/dX = 1 if x>0 or 0 if x<0

what does this mean philosophically? if x>0 pass my gradients (y speaking) as it is if not stop make all the gradients 0 ( do not change the weights they did not contribute to the loss)

example 2

Residual Connection Y = X + Z

dY/dX = dY/dZ = 1

so the gradients flow as it is from Y to X and Z

note: we talked about relu above and only calculated dY/dX and not dY/dW because ReLU does not have any parameters to tune, if we were using GeGLU we would have also calculated dy/DW because we also want its parameters to learn.

now we have completed activation functions lets move on to the matrix multiplications

Y = XW

we want to learn two things dL/dX and dL/dW

so here are the formulas, please memorize

you are now all set for calculating any gradient in transformers

but lets cover one special hard case, that is gradient of loss wrt to the Logit layer

Softmax

z = logits (shape: b, t, v where v is vocab size).

p = softmax(z) (probabilities).

l = -log(p_correct_token).

deriving softmax is messy (lots of jacobian matrices), but the final result is elegantly simple

dL/dZ = P - Y(one hot)

and from here it is all matrix multiplications and activation functions which we have already covered. so no worries.

COMPUTING GRADIENTS FOR THE SAKE OF IT