NVIDIA Tensor Cores

Hardware-Level Matrix-Matrix Multiplication (MMA)

Instruction: $HMMA.16.8.16$

Unit Control

Hardware Metrics

Relative Power 1x

*Tensor Cores reduce power by minimizing high-energy register file accesses.

Accumulator C (FP32)

CUDA Core Parallelism

A warp of 32 threads executes the same instruction across different scalar data. Here, we show 4 threads calculating 4 elements of the result tile. Each thread must fetch its own operands, putting high pressure on the register file.

The MMA Data Path

FP16 Input FP16 Input

Internal FMA Multiplier

FP32 Accumulation Stage

Rounding happens once at the final output

Architectural Advantage

Tensor Cores are specialized for Mixed-Precision. By using $FP16$ for the large weight matrices and $FP32$ for the accumulation, the GPU gains the speed of half-precision without the instability of cumulative rounding errors found in pure 16-bit math.

Throughput Speedup: Up to 16x