Advanced Quantization: Systems & Training

Visualize macroscopic quantization concepts: QAT, Inference strategies, and Modern LLM Weight-Only methods.

Straight-Through Estimator (STE) in QAT

During training, we need gradients to update FP weights. But the "Quantizer" step function has a gradient of 0 everywhere. The STE routes the gradient straight through.

How to use this simulator:

  1. ACTION: Click the "Apply Gradients" button at the bottom! It simulates a real training loop.
  2. Weights update, quantized integers occasionally flash as they snap to new steps.
Weight r (FP32)
Quantizer
Quantized Weight Q (INT)
1
2
-2
2
Forward Pass
Gradient dL/dr (FP32)
0.6
-0.5
-0.8
0.7
STE (Identity)
Gradient dL/dQ (FP)
Backward Pass
+
Epoch: 0

Inference Activation Strategies

Weights are quantized statically. Activations use parameters found before deployment (Static) or during (Dynamic).

Incoming Activation Batch (1D Distribution Plot)

r_min
r_max
Original:
Result: Run a strategy below...
Static Quantization Fast
Dynamic Quantization Slower
Result: Clean

Weight-Only & Grouped Quantization (LLMs)

Data Flow Architecture

Weights
INT8
Dequant ⇨
FP16 Math
Weight-Only (W8A16): Fast memory access, high-precision math.

Grouped (Block) Quantization

Granularity & Calibration Strategies

Optimizing precision by picking better ranges (Calibration) and better scale buckets (Granularity).

Calibration (Picking the Range)

-10.00.0+10.0
Min-Max Calibration: Range stretches to the extreme outliers. Bins are large, so precision is extremely low for the main data distribution.

Granularity (Bucketing)

Channel 1 Scale: 0.1
Channel 2 Scale: 0.1
Channel 3 Scale: 0.1
Per-Tensor: One scale for all channels. Because Channel 3 has a huge outlier, everyone must use a wide, low-precision scale. Small channels lose all resolution.