Advanced Quantization: Systems & Training

Visualize macroscopic quantization concepts: QAT, Inference strategies, and Modern LLM Weight-Only methods.

Straight-Through Estimator (STE) in QAT

During training, we need gradients to update FP weights. But the "Quantizer" step function has a gradient of 0 everywhere. The STE routes the gradient straight through.

How to use this simulator:

ACTION: Click the "Apply Gradients" button at the bottom! It simulates a real training loop.
Weights update, quantized integers occasionally flash as they snap to new steps.

Weight r (FP32)

Quantizer

⇨

Quantized Weight Q (INT)

-2

⇨

Forward Pass

⇩

Gradient dL/dr (FP32)

0.6

-0.5

-0.8

0.7

⇦

STE (Identity)

⇦

Gradient dL/dQ (FP)

⇦

Backward Pass

⇧

Epoch: 0

Inference Activation Strategies

Weights are quantized statically. Activations use parameters found before deployment (Static) or during (Dynamic).

Incoming Activation Batch (1D Distribution Plot)

r_min

r_max

Original:

Result: Run a strategy below...

Static Quantization Fast

Dynamic Quantization Slower

Result: Clean

Weight-Only & Grouped Quantization (LLMs)

Data Flow Architecture

Weights
INT8

Dequant ⇨

FP16 Math

Weight-Only (W8A16): Fast memory access, high-precision math.

Grouped (Block) Quantization

Granularity & Calibration Strategies

Optimizing precision by picking better ranges (Calibration) and better scale buckets (Granularity).

Calibration (Picking the Range)

-10.00.0+10.0

Min-Max Calibration: Range stretches to the extreme outliers. Bins are large, so precision is extremely low for the main data distribution.

Granularity (Bucketing)

Channel 1 Scale: 0.1

Channel 2 Scale: 0.1

Channel 3 Scale: 0.1

Per-Tensor: One scale for all channels. Because Channel 3 has a huge outlier, everyone must use a wide, low-precision scale. Small channels lose all resolution.