Tiled Matrix Multiplication: Physical Trace

Tracking hardware blocks ($L=4$) and software tiles ($T=2$).

General Efficiency Formulas

Standard (Non-Tiled)

$$\text{Total Misses} \approx \frac{n^3}{L} \times (0_A + 1_B + 1_C) = 0.50n^3$$

Assumes $kij$ order where A is stationary but B/C are re-fetched for every iteration.

Tiled Optimization

$$\text{Total Misses} \approx \frac{n^3}{2T} = \frac{n^3}{4} = 0.25n^3$$

Assumes $3T^2 < C$. Total misses reduce as tile size $T$ increases.

Control Panel

Current Loop Indices

i:0

j:0

k:0

il:0

jl:0

kl:0

Active Float

Hardware Block (16B)

Software Tile ($T=2$)

Physical Cache Stats Step: 1 / 64

Current Step Misses

A, B, C

Total Physical Misses

Analyzing hardware state...

                    void mmm(float a[n][n], float b[n][n], float c[n][n], int n) { //
                    THREE OUTER LOOPS ITERATE OVER TILES
                    for (i = 0; i < n; i +=T) for (j=0; j < n; j +=T) for (k=0; k < n; k +=T)

                            // THREE INNER LOOPS ITERATE WITHIN TILES
                            for (il = i; il < i + T; il++) for (jl=j; jl < j + T; jl++) for (kl=k; kl < k + T; kl++) c[il][jl] += a[il][kl] * b[kl][jl];
                    }