Reducing Memory Cost: Tiling & Reuse

Visualizing how matrix partitioning minimizes expensive off-chip memory access.

Efficiency Logic: Deep Dive

AI measures Operations per Byte of memory traffic. This determines if a program is "Memory Bound" or "Compute Bound."

Large Memory DRAM Cost

$$\text{Total Reads} = (M \times CHW) + (CHW \times N)$$

Everything is fetched exactly once.

$$AI = \frac{\text{Total MACs}}{\text{Total Bytes Transferred}}$$

Assuming Float32 (4 bytes per element):

$$AI = \frac{\text{Reuse Factor}}{4}$$

Measures operations per **element** fetched. It is the unit-less version of Arithmetic Intensity.

$$\text{Total MACs} = M \times N \times CHW$$ $$\text{Reuse} = \frac{\text{Total MACs}}{\text{Total Reads (Elements)}}$$

Memory Scenario

Step 0 of 0

Idle

Memory Reads

MAC Ops

Reuse Factor

0.0x

Ops/Elem

Intensity

0.00

Ops/Byte

Active Tile

[M, N]

Select a scenario to begin.

Filters ($M$)

$CHW$

$M$

Input fmaps

$N$

$CHW$

Output fmaps

$N$

$M$

Processing Element (PE)

Local Memory (SRAM)

[ Waiting for instructions... ]