Knowledge Distillation - EE508 Interactive Visualizer

🤔 Why Knowledge Distillation?

Large models are accurate but expensive to deploy. We need smarter ways to get small, fast models without sacrificing too much accuracy.

⚖ The Big vs Small Model Dilemma

🏫

Teacher Model

e.g. ResNet-152

Parameters60M

MACs / inference11.6B

Memory Footprint~240 MB

Latency (CPU)~180 ms

Top-1 Accuracy78.3%

🎓

Student Model

e.g. MobileNetV2

Parameters3.4M

MACs / inference300M

Memory Footprint~14 MB

Latency (CPU)~22 ms

Top-1 Accuracy71.8%

📊 Hardware Cost Comparison

Parameters94% Reduction

Teacher (60M)

Student (3.4M)

Inference MACs97% Reduction

Teacher (11.6B)

Student (300M)

Latency88% Faster

Teacher (180ms)

Student (22ms)

🔑 Key Insight: Knowledge Distillation is a model-level optimization that produces smaller, faster models for edge hardware — directly reducing FLOPs, memory constraints, and latency.

🏷 Hard Labels vs Soft Labels

Temperature is applied to the raw logits, not the probabilities. The logits never change — only the softmax reshapes them.

🖼 Choose an Input Image Class

Step 1 — Raw logits from teacher (pre-softmax, these never change)

The teacher's raw network outputs (Logits). Temperature only affects how these are converted into probabilities below.

↓ Apply softmax with different temperatures ↓

❌ Hard label (one-hot, T=1)

All inter-class info lost. Student model only learns "this is a cat."

✅ Soft label (teacher, T=2)

"Dark knowledge" — a cat resembles a dog more than a plane. Richer signal.

🌑 What is "Dark Knowledge"?

The teacher's logits encode inter-class similarity. At T=1, standard softmax hides this by pushing the max value to near 100%. At T>1, the distribution softens, revealing "dark knowledge" — the teacher's uncertainty between related classes.

        Hard label: [1, 0, 0, 0, 0] ← 1 bit of information

        Soft label: [0.55, 0.32, 0.07, 0.04, 0.02] ← rich inter-class signal

🔑 Key Insight: The teacher's logits never change — temperature only reshapes the probability distribution. Soft labels preserve inter-class relationships that hard labels discard entirely.

🌡 Temperature Scaling

Temperature T reshapes the same logits into different probability distributions. Drag to see live changes.

🏛 Interactive Temperature Control

T = 1.0

Sharp (0.1)Flat (10)

Normal softmax

p_i =

exp(z_i / T)

Σ exp(z_j / T)

Max prob

—

Entropy

—

Softmax probability output

📐 Why temperature matters

🔥 T < 1 (Hot)

Amplifies differences. Winner takes almost all probability mass.

⚖ T = 1 (Normal)

Standard softmax. Used during standard training and inference.

❄ T > 1 (Cold)

Smooths distribution. More "dark knowledge" exposed for distillation.

Both teacher and student use the same T during distillation. At inference time, T resets to 1.

🔑 Key Insight: Higher T softens the teacher's confidence, revealing inter-class relationships. Geoffrey Hinton recommended T = 2–5 — a sweet spot between information richness and stability.

🔄 Distillation Training Process

The student learns from two signals simultaneously. Control the balance with α.

🌊 Knowledge Flow Diagram

🖼 Input Image (x)

↑ frozen weights

🏫 Teacher
large pretrained model

↑ being trained

🎓 Student
compact model

↓

Soft probs (T > 1) ✨

↓

Student probs (T > 1)

↓

📉 KL Divergence
Distillation Loss

↓

📉 Cross Entropy
Hard Label Loss (T = 1)

⚡ Combined Loss → Backprop → Update Student

🎚 Loss Weighting — Control α

α = 0.70 drag to balance

Hard only (α=0)Soft only (α=1)

🌡 Distillation Loss (KL-div)

70%

Teacher → Student, uses T

🎯 CE Loss (Hard Labels)

30%

Ground truth, standard cross entropy

Combined Loss Function

L = 0.70 × T² × KL(p_T ∥ p_S) + 0.30 × CE(y, p_S)

🔑 Key Insight: The T² scaling factor compensates for the fact that gradients produced by the soft targets scale as 1/T². Without it, the distillation signal would be numerically overwhelmed by the hard cross-entropy loss.

⚡ Hardware Impact

How much does distillation actually help on real hardware? Hover the scatter plot for details.

📈 Accuracy vs Model Size Tradeoff

Large teacher Small (trained from scratch) Small (distilled)

🗺 Roofline Model — where KD moves your workload

Knowledge Distillation (coupled with INT8 quantization) pushes models off the memory-bandwidth bound slope and up into the compute-bound ceiling — maximizing actual hardware utilization.

Teacher (FP32): Low Arithmetic Intensity. Stuck on the memory slope. Every DRAM fetch costs ~640pJ of energy.

Student (INT8) after KD: High Arithmetic Intensity. Entire model fits in SRAM (~5pJ/access), hitting the peak compute ceiling.

🚀 Hardware Deployment Savings

🔢

Parameters

60M→3.4M

94% reduction

⚙

MACs / Inference

11.6B→300M

97% reduction

💾

Memory Footprint

240MB→14MB

94% reduction

⏱

Latency

180ms→22ms

88% faster

🔋

Energy / Inference

4.2mJ→0.3mJ

93% less energy

🎯

Accuracy Drop

78.3%→74.7%

Only -3.6%!

🔑 Key Insight: Knowledge Distillation is fundamentally a hardware deployment strategy. A distilled model fits perfectly in fast SRAM, avoids catastrophic DRAM energy penalties, and processes requests instantly.