No Blocking: $\frac{5}{4} n^3$ Misses

Visualizing cache inefficiency with $L=4$ (4 elements per cache line).

1. The Inner Loop (k-loop)

To compute one element $C[i][j]$, we read one row of $A$ and one column of $B$.

Loop: $C[i][j] += A[i][k] * B[k][j]$

2. Misses for Matrix A

Matrix A is accessed in row-major order. With $L=4$, we get 1 miss every 4 elements. $$\text{Misses per row of A} = n / 4$$

3. The Problem: Matrix B

Matrix B is accessed in column-major order. Each access $B[k][j]$ jumps to a new row, causing a miss for every single element. $$\text{Misses per column of B} = n$$

4. Misses for One $C[i][j]$

Summing misses from A and B for one single dot product calculation: $$\text{Misses} = \frac{n}{4} + n = \frac{5}{4}n$$

This results in 1.25 misses per multiplication/addition pair.

5. Total Matrix Misses

We repeat this for all $n^2$ elements of Matrix C. $$\text{Total} = n^2 \times \frac{5}{4}n = \mathbf{\frac{5}{4}n^3}$$
Total Misses = $\frac{5}{4} n^3$

Access Pattern