🧠

Knowledge Distillation

Interactive Visualizer — EE508: Hardware Foundations for ML

Model Reduction

🤔 Why Knowledge Distillation?

Large models are accurate but expensive to deploy. We need smarter ways to get small, fast models without sacrificing too much accuracy.

⚖ The Big vs Small Model Dilemma

🏫

Teacher Model

e.g. ResNet-152
Parameters60M
MACs / inference11.6B
Memory Footprint~240 MB
Latency (CPU)~180 ms
Top-1 Accuracy78.3%
🎓

Student Model

e.g. MobileNetV2
Parameters3.4M
MACs / inference300M
Memory Footprint~14 MB
Latency (CPU)~22 ms
Top-1 Accuracy71.8%

📊 Hardware Cost Comparison

Parameters94% Reduction
Teacher (60M)
Student (3.4M)
Inference MACs97% Reduction
Teacher (11.6B)
Student (300M)
Latency88% Faster
Teacher (180ms)
Student (22ms)
🔑 Key Insight: Knowledge Distillation is a model-level optimization that produces smaller, faster models for edge hardware — directly reducing FLOPs, memory constraints, and latency.

🏷 Hard Labels vs Soft Labels

Temperature is applied to the raw logits, not the probabilities. The logits never change — only the softmax reshapes them.

🖼 Choose an Input Image Class

Step 1 — Raw logits from teacher (pre-softmax, these never change)
The teacher's raw network outputs (Logits). Temperature only affects how these are converted into probabilities below.
↓   Apply softmax with different temperatures   ↓

❌ Hard label (one-hot, T=1)

All inter-class info lost. Student model only learns "this is a cat."

✅ Soft label (teacher, T=2)

"Dark knowledge" — a cat resembles a dog more than a plane. Richer signal.

🌑 What is "Dark Knowledge"?

The teacher's logits encode inter-class similarity. At T=1, standard softmax hides this by pushing the max value to near 100%. At T>1, the distribution softens, revealing "dark knowledge" — the teacher's uncertainty between related classes.

Hard label: [1, 0, 0, 0, 0] ← 1 bit of information
Soft label: [0.55, 0.32, 0.07, 0.04, 0.02] ← rich inter-class signal
🔑 Key Insight: The teacher's logits never change — temperature only reshapes the probability distribution. Soft labels preserve inter-class relationships that hard labels discard entirely.

🌡 Temperature Scaling

Temperature T reshapes the same logits into different probability distributions. Drag to see live changes.

🏛 Interactive Temperature Control

T = 1.0
Sharp (0.1)Flat (10)
Normal softmax
pi =
exp(zi / T)
Σ exp(zj / T)
Max prob
Entropy

Softmax probability output

📐 Why temperature matters

🔥 T < 1 (Hot)

Amplifies differences. Winner takes almost all probability mass.

⚖ T = 1 (Normal)

Standard softmax. Used during standard training and inference.

❄ T > 1 (Cold)

Smooths distribution. More "dark knowledge" exposed for distillation.

Both teacher and student use the same T during distillation. At inference time, T resets to 1.

🔑 Key Insight: Higher T softens the teacher's confidence, revealing inter-class relationships. Geoffrey Hinton recommended T = 2–5 — a sweet spot between information richness and stability.

🔄 Distillation Training Process

The student learns from two signals simultaneously. Control the balance with α.

🌊 Knowledge Flow Diagram

🖼 Input Image (x)
↑ frozen weights
🏫 Teacher
large pretrained model
↑ being trained
🎓 Student
compact model
Soft probs (T > 1) ✨
Student probs (T > 1)
📉 KL Divergence
Distillation Loss
📉 Cross Entropy
Hard Label Loss (T = 1)
⚡ Combined Loss → Backprop → Update Student

🎚 Loss Weighting — Control α

Hard only (α=0)Soft only (α=1)
🌡 Distillation Loss (KL-div)
70%
Teacher → Student, uses T
🎯 CE Loss (Hard Labels)
30%
Ground truth, standard cross entropy
Combined Loss Function
L = 0.70 × T² × KL(pTpS) + 0.30 × CE(y, pS)
🔑 Key Insight: The T² scaling factor compensates for the fact that gradients produced by the soft targets scale as 1/T². Without it, the distillation signal would be numerically overwhelmed by the hard cross-entropy loss.

⚡ Hardware Impact

How much does distillation actually help on real hardware? Hover the scatter plot for details.

📈 Accuracy vs Model Size Tradeoff

Large teacher Small (trained from scratch) Small (distilled)

🗺 Roofline Model — where KD moves your workload

Knowledge Distillation (coupled with INT8 quantization) pushes models off the memory-bandwidth bound slope and up into the compute-bound ceiling — maximizing actual hardware utilization.

Teacher (FP32): Low Arithmetic Intensity. Stuck on the memory slope. Every DRAM fetch costs ~640pJ of energy.
Student (INT8) after KD: High Arithmetic Intensity. Entire model fits in SRAM (~5pJ/access), hitting the peak compute ceiling.

🚀 Hardware Deployment Savings

🔢
Parameters
60M3.4M
94% reduction
MACs / Inference
11.6B300M
97% reduction
💾
Memory Footprint
240MB14MB
94% reduction
Latency
180ms22ms
88% faster
🔋
Energy / Inference
4.2mJ0.3mJ
93% less energy
🎯
Accuracy Drop
78.3%74.7%
Only -3.6%!
🔑 Key Insight: Knowledge Distillation is fundamentally a hardware deployment strategy. A distilled model fits perfectly in fast SRAM, avoids catastrophic DRAM energy penalties, and processes requests instantly.