NVIDIA Tensor Cores

Hardware-Level Matrix-Matrix Multiplication (MMA)

Unit Control

Hardware Metrics

Register Fetches 128
Relative Power 1x
*Tensor Cores reduce power by minimizing high-energy register file accesses.
Register File A (FP16)
×
Register File B (FP16)
=
Accumulator C (FP32)

CUDA Core Parallelism

A warp of 32 threads executes the same instruction across different scalar data. Here, we show 4 threads calculating 4 elements of the result tile. Each thread must fetch its own operands, putting high pressure on the register file.

The MMA Data Path

FP16 Input FP16 Input
Internal FMA Multiplier
FP32 Accumulation Stage
Rounding happens once at the final output

Architectural Advantage

Tensor Cores are specialized for Mixed-Precision. By using $FP16$ for the large weight matrices and $FP32$ for the accumulation, the GPU gains the speed of half-precision without the instability of cumulative rounding errors found in pure 16-bit math.

Throughput Speedup: Up to 16x