Tiled Matrix Multiplication: Physical Trace

Tracking hardware blocks (\(L=4\)) and software tiles (\(T=2\)).

General Efficiency Formulas

Standard (Non-Tiled)

$$\text{Total Misses} \approx \frac{n^3}{L} \times (0_A + 1_B + 1_C) = 0.50n^3$$
Assumes \(kij\) order where A is stationary but B/C are re-fetched for every iteration.

Tiled Optimization

$$\text{Total Misses} \approx \frac{n^3}{2T} = \frac{n^3}{4} = 0.25n^3$$
Assumes \(3T^2 < C\). Total misses reduce as tile size \(T\) increases.
Control Panel
Current Loop Indices
i:0
j:0
k:0
il:0
jl:0
kl:0
Active Float
Hardware Block (16B)
Software Tile (\(T=2\))
A
×
B
=
C
Physical Cache Stats Step: 1 / 64
Current Step Misses
3
A, B, C
Total Physical Misses
3
Analyzing hardware state...
void mmm(float a[n][n], float b[n][n], float c[n][n], int n) { // THREE OUTER LOOPS ITERATE OVER TILES for (i = 0; i < n; i +=T) for (j=0; j < n; j +=T) for (k=0; k < n; k +=T) // THREE INNER LOOPS ITERATE WITHIN TILES for (il = i; il < i + T; il++) for (jl=j; jl < j + T; jl++) for (kl=k; kl < k + T; kl++) c[il][jl] += a[il][kl] * b[kl][jl]; }