14 GiB/s
14 GiB/s
30 GiB/s
30 GiB/s
30 GiB/s
10 GiB/s
167 GiB/s
167 GiB/s
PCIe Gen3 x16
Interface
Host Interface
DDR3 DRAM Chips
DDR3 Interfaces
Weight FIFO
(Weight Fetcher)
(Weight Fetcher)
Unified
Buffer (Local
Activation
Storage)
Buffer (Local
Activation
Storage)
Systolic
Data
Setup
Data
Setup
Matrix Multiply Unit
(64K per cycle)
(64K per cycle)
Accumulators
Activation
Normalize / Pool
Control
Control
Control
Control
Control
Instr
Systolic Array
Input Data
Weights are preloaded
Done
Partial Sums
Data
Control
.
.
.
+
+
+
..
+
Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.