Interactive Visualizer — EE508: Hardware Foundations for ML
Model Reduction
🤔 Why Knowledge Distillation?
Large models are accurate but expensive to deploy. We need smarter ways to get small, fast models without sacrificing too much accuracy.
⚖ The Big vs Small Model Dilemma
🏫
Teacher Model
e.g. ResNet-152
Parameters60M
MACs / inference11.6B
Memory Footprint~240 MB
Latency (CPU)~180 ms
Top-1 Accuracy78.3%
🎓
Student Model
e.g. MobileNetV2
Parameters3.4M
MACs / inference300M
Memory Footprint~14 MB
Latency (CPU)~22 ms
Top-1 Accuracy71.8%
📊 Hardware Cost Comparison
Parameters94% Reduction
Teacher (60M)
Student (3.4M)
Inference MACs97% Reduction
Teacher (11.6B)
Student (300M)
Latency88% Faster
Teacher (180ms)
Student (22ms)
🔑 Key Insight: Knowledge Distillation is a model-level optimization that produces smaller, faster models for edge hardware — directly reducing FLOPs, memory constraints, and latency.
🏷 Hard Labels vs Soft Labels
Temperature is applied to the raw logits, not the probabilities. The logits never change — only the softmax reshapes them.
🖼 Choose an Input Image Class
Step 1 — Raw logits from teacher (pre-softmax, these never change)
The teacher's raw network outputs (Logits). Temperature only affects how these are converted into probabilities below.
↓ Apply softmax with different temperatures ↓
❌ Hard label (one-hot, T=1)
All inter-class info lost. Student model only learns "this is a cat."
✅ Soft label (teacher, T=2)
"Dark knowledge" — a cat resembles a dog more than a plane. Richer signal.
🌑 What is "Dark Knowledge"?
The teacher's logits encode inter-class similarity. At T=1, standard softmax hides this by pushing the max value to near 100%. At T>1, the distribution softens, revealing "dark knowledge" — the teacher's uncertainty between related classes.
Hard label: [1, 0, 0, 0, 0] ← 1 bit of information Soft label: [0.55, 0.32, 0.07, 0.04, 0.02] ← rich inter-class signal
🔑 Key Insight: The teacher's logits never change — temperature only reshapes the probability distribution. Soft labels preserve inter-class relationships that hard labels discard entirely.
🌡 Temperature Scaling
Temperature T reshapes the same logits into different probability distributions. Drag to see live changes.
🏛 Interactive Temperature Control
T = 1.0
Sharp (0.1)Flat (10)
Normal softmax
pi =
exp(zi / T)
Σ exp(zj / T)
Max prob
—
Entropy
—
Softmax probability output
📐 Why temperature matters
🔥 T < 1 (Hot)
Amplifies differences. Winner takes almost all probability mass.
⚖ T = 1 (Normal)
Standard softmax. Used during standard training and inference.
❄ T > 1 (Cold)
Smooths distribution. More "dark knowledge" exposed for distillation.
Both teacher and student use the same T during distillation. At inference time, T resets to 1.
🔑 Key Insight: Higher T softens the teacher's confidence, revealing inter-class relationships. Geoffrey Hinton recommended T = 2–5 — a sweet spot between information richness and stability.
🔄 Distillation Training Process
The student learns from two signals simultaneously. Control the balance with α.
🔑 Key Insight: The T² scaling factor compensates for the fact that gradients produced by the soft targets scale as 1/T². Without it, the distillation signal would be numerically overwhelmed by the hard cross-entropy loss.
⚡ Hardware Impact
How much does distillation actually help on real hardware? Hover the scatter plot for details.
📈 Accuracy vs Model Size Tradeoff
Large teacher
Small (trained from scratch)
Small (distilled)
🗺 Roofline Model — where KD moves your workload
Knowledge Distillation (coupled with INT8 quantization) pushes models off the memory-bandwidth bound slope and up into the compute-bound ceiling — maximizing actual hardware utilization.
Teacher (FP32): Low Arithmetic Intensity. Stuck on the memory slope. Every DRAM fetch costs ~640pJ of energy.
Student (INT8) after KD: High Arithmetic Intensity. Entire model fits in SRAM (~5pJ/access), hitting the peak compute ceiling.
🚀 Hardware Deployment Savings
🔢
Parameters
60M→3.4M
94% reduction
⚙
MACs / Inference
11.6B→300M
97% reduction
💾
Memory Footprint
240MB→14MB
94% reduction
⏱
Latency
180ms→22ms
88% faster
🔋
Energy / Inference
4.2mJ→0.3mJ
93% less energy
🎯
Accuracy Drop
78.3%→74.7%
Only -3.6%!
🔑 Key Insight: Knowledge Distillation is fundamentally a hardware deployment strategy. A distilled model fits perfectly in fast SRAM, avoids catastrophic DRAM energy penalties, and processes requests instantly.