Visualizing cache inefficiency with $L=4$ (4 elements per cache line).
1. The Inner Loop (k-loop)
To compute one element $C[i][j]$, we read one row of $A$
and one column of $B$.
Loop: $C[i][j] += A[i][k] * B[k][j]$
2. Misses for Matrix A
Matrix A is accessed in row-major order. With $L=4$, we get
1 miss every 4 elements. $$\text{Misses per row of A} = n / 4$$
3. The Problem: Matrix B
Matrix B is accessed in column-major order. Each access
$B[k][j]$ jumps to a new row, causing a miss for every
single element. $$\text{Misses per column of B} = n$$
4. Misses for One $C[i][j]$
Summing misses from A and B for one single dot product
calculation: $$\text{Misses} = \frac{n}{4} + n = \frac{5}{4}n$$
This results in 1.25 misses per multiplication/addition pair.
5. Total Matrix Misses
We repeat this for all $n^2$ elements of Matrix C. $$\text{Total}
= n^2 \times \frac{5}{4}n = \mathbf{\frac{5}{4}n^3}$$