Hardware-Level Matrix-Matrix Multiplication (MMA)
A warp of 32 threads executes the same instruction across different scalar data. Here, we show 4 threads calculating 4 elements of the result tile. Each thread must fetch its own operands, putting high pressure on the register file.
Tensor Cores are specialized for Mixed-Precision. By using $FP16$ for the large weight matrices and $FP32$ for the accumulation, the GPU gains the speed of half-precision without the instability of cumulative rounding errors found in pure 16-bit math.