Goyalayus

Block Matrix Multiplication

Block matrix multiplication is not a new kind of multiplication. It is ordinary matrix multiplication with a better schedule.

That distinction matters. The mathematical object does not change. If

C = A B

then every entry of $C$ is still the dot product of one row of $A$ with one column of $B$ :

C_{ij} = k = 1 \sum n A_{ik} B_{k j}

The blocked version computes the same sums, in the same field, with the same final answer. What changes is the order in which we visit memory.

For small matrices, this sounds like a bookkeeping detail. For large matrices, it is the difference between a program that spends most of its time doing arithmetic and a program that spends most of its time waiting for data to arrive.

The Basic Matrix Multiplication Picture

Suppose $A$ is an $m \times n$ matrix and $B$ is an $n \times p$ matrix. Their product $C = A B$ is an $m \times p$ matrix.

A simple triple-loop implementation looks like this:

for i in 0..m-1:
  for j in 0..p-1:
    C[i,j] = 0
    for k in 0..n-1:
      C[i,j] += A[i,k] * B[k,j]

This is the definition translated directly into code. For every output cell, walk across a row of $A$ , walk down a column of $B$ , multiply matching entries, and add them.

The arithmetic cost is easy to count. There are $m \cdot p$ output entries, and each one does $n$ multiply-adds, so the work is approximately:

O (mn p)

For square $N \times N$ matrices, that becomes $O (N^{3})$ .

But the arithmetic count is only half the story. Modern processors can execute huge numbers of multiply-add operations per second. The harder question is: how many times do we have to fetch the same values from memory?

Why the Naive Schedule Wastes Memory

Computers have a memory hierarchy. Registers are tiny and extremely fast. CPU caches are larger but slower. Main memory is much larger and much slower. GPUs have their own version of the same idea: registers, shared memory or local memory, caches, and global memory.

Fast matrix multiplication is mostly an attempt to keep useful values close to the arithmetic units for as long as possible.

In the naive loop above, each output entry $C_{ij}$ streams through a row of $A$ and a column of $B$ . Depending on layout, one of those accesses is usually friendly and the other is awkward.

In row-major storage, consecutive elements of a row sit next to each other in memory. So reading $A [i, k]$ as $k$ increases is nice: we walk through memory in order. But reading $B [k, j]$ as $k$ increases jumps from one row of $B$ to the next while holding the column fixed. That can skip across memory with a large stride.

Even worse, the same values are needed many times. A value like $A [i, k]$ is reused for many different columns $j$ . A value like $B [k, j]$ is reused for many different rows $i$ . The naive schedule often lets those values fall out of fast memory before all their useful work is done.

So the problem is not that $C_{ij} = \sum_{k} A_{ik} B_{k j}$ is wrong. The problem is that the straightforward order of evaluation has poor locality.

Block Decomposition

The blocked idea is to stop thinking of a matrix as only a grid of individual numbers. Instead, view it as a grid of smaller matrices.

For example, split $A$ and $B$ into four blocks each:

A = [A_{11} A_{21} A_{12} A_{22}] B = [B_{11} B_{21} B_{12} B_{22}]

Then their product is:

C = A B = [C_{11} C_{21} C_{12} C_{22}]

where

C_{11} = A_{11} B_{11} + A_{12} B_{21}

C_{12} = A_{11} B_{12} + A_{12} B_{22}

C_{21} = A_{21} B_{11} + A_{22} B_{21}

C_{22} = A_{21} B_{12} + A_{22} B_{22}

This is the same rule as scalar matrix multiplication, just lifted one level up. Instead of multiplying numbers and adding them, we multiply submatrices and add them.

More generally, if $A$ is divided into blocks $A_{ik}$ and $B$ into blocks $B_{k j}$ , then each block of $C$ is:

C_{ij} = k \sum A_{ik} B_{k j}

Notice how similar this is to the scalar formula. The indices mean the same thing. The only difference is that each symbol now represents a tile rather than a single entry.

A Small Worked Example

Take two $4 \times 4$ matrices and split each into $2 \times 2$ blocks.

Let

A = 15913261014371115481216

and

B = 1012011021011201

Split $A$ into:

A_{11} = [1526], A_{12} = [3748]

A_{21} = [9131014], A_{22} = [11151216]

and split $B$ into:

B_{11} = [1001], B_{12} = [2112]

B_{21} = [1210], B_{22} = [0101]

Now compute the top-left block of $C$ :

C_{11} = A_{11} B_{11} + A_{12} B_{21}

The first product is simple because $B_{11}$ is the identity matrix:

A_{11} B_{11} = [1526]

The second product is:

A_{12} B_{21} = [3748] [1210] = [112337]

C_{11} = [1526] + [112337] = [1228513]

If you compute the full product directly, the top-left $2 \times 2$ corner is exactly the same:

C = A B = 12284460513212982440569254157

Blocking did not approximate anything. It only grouped the work.

The Cache Locality Intuition

The reason blocking helps is reuse.

Imagine we want to compute a tile $C_{ij}$ . We load one tile from $A$ , one tile from $B$ , and use them to update the tile in $C$ . If the tile size is chosen well, those pieces fit in cache or shared memory.

A blocked CPU-style loop has the shape:

for ii in blocks_of_rows(A):
  for jj in blocks_of_cols(B):
    for kk in blocks_of_inner_dimension:
      multiply A[ii,kk] with B[kk,jj]
      accumulate into C[ii,jj]

The important part is that each loaded tile participates in many arithmetic operations before being evicted.

Suppose a tile is $b \times b$ . Loading one tile of $A$ brings in $b^{2}$ numbers. Loading one tile of $B$ brings in another $b^{2}$ numbers. Multiplying those two tiles performs about $b^{3}$ multiply-adds.

That means the amount of arithmetic per loaded value grows with $b$ :

arithmetic intensity \approx \frac{2 b ^{3}}{2 b ^{2}} = b

This is not a perfect hardware model, but it captures the main idea. Larger tiles allow more computation per byte fetched. The tile cannot be too large, though, because it must fit in the fast memory level we are targeting.

This is why real GEMM kernels are full of carefully chosen block sizes. There are blocks for L1 cache, blocks for L2 cache, register tiles, vector widths, memory alignment, and sometimes packing layouts. The beautiful high-level formula eventually becomes a very practical question: what tile shape lets this machine reuse data the most?

Why This Matters on GPUs

On a GPU, the same idea becomes even more visible.

A GPU has many threads. Global memory is large but expensive to access. Shared memory is much faster but limited. A common tiled matrix multiplication kernel assigns a thread block to a tile of $C$ .

The thread block repeatedly does this:

Load a tile of $A$ from global memory into shared memory.
Load a tile of $B$ from global memory into shared memory.
Synchronize the threads so the tile is fully available.
Compute partial dot products using the shared tiles.
Move to the next tile along the inner dimension $k$ .
Accumulate until the output tile is complete.

The win is that many threads reuse the same shared-memory tiles. Without tiling, each thread might fetch overlapping data from global memory independently. With tiling, the block cooperates: load once, reuse many times.

This also explains why matrix multiplication is such a natural fit for GPUs. It has enormous regularity. The same operation is performed across a grid. The memory access pattern can be made predictable. Each output tile has enough independent arithmetic to keep many lanes busy.

Tensor cores push this even further. They operate on small matrix fragments directly, doing matrix multiply-accumulate operations as hardware primitives. But the surrounding logic is still a blocking story: move fragments of $A$ and $B$ close to the compute units, reuse them intensely, and accumulate into fragments of $C$ .

What Blocking Does Not Change

Blocking does not reduce the asymptotic arithmetic count of ordinary matrix multiplication. If we use the classical algorithm, square matrix multiplication is still $O (N^{3})$ .

Algorithms like Strassen change the arithmetic structure. Blocking changes the execution schedule.

That makes blocking less glamorous but incredibly important. Most practical performance wins in dense linear algebra do not come from discovering a new formula. They come from arranging the old formula so the hardware spends less time idle.

The same pattern appears everywhere in systems and machine learning:

GEMM libraries such as BLAS implementations use multilevel blocking.
CUDA kernels use shared-memory tiling.
Transformer inference depends heavily on fast matrix multiply kernels.
Attention implementations tile queries, keys, and values to reduce high-bandwidth-memory traffic.
Compiler optimizations often try to restructure loops to improve locality.

In all of these cases, the question is not only “how many operations?” but also “where is the data when those operations happen?”

Choosing a Block Size

A good block size depends on the hardware and the data type.

If the block is too small, we do not reuse enough data. The program spends too much time on loop overhead and memory movement.

If the block is too large, the tile does not fit in the intended fast memory. Then the cache starts thrashing, shared memory pressure increases, registers spill, or occupancy drops.

So the best size is a compromise among several limits:

cache capacity or shared memory capacity
register file pressure
vector width or tensor core fragment shape
memory alignment
number of active threads or warps
cost of synchronization
shape of the matrices themselves

This is why highly optimized matrix multiplication code often looks complicated. The underlying equation is short, but the implementation is negotiating with the hardware at every level.

The Mental Model

Block matrix multiplication is the same math with better locality.

The scalar formula is:

C_{ij} = k \sum A_{ik} B_{k j}

The block formula is:

C_{ij} = k \sum A_{ik} B_{k j}

The notation is identical because the idea is identical. The only difference is what the symbols represent: numbers in the first case, tiles in the second.

That is the whole trick. Bring a chunk of data close. Do a lot of useful work with it. Write back the result. Then move to the next chunk.

Fast matrix multiplication is not just about multiplying quickly. It is about making every byte earn its trip through the memory hierarchy.