EE508 · ML Fundamentals

Bias-Variance & Regularization Explorer

A model can fit the training data extremely well and still generalize poorly. This page shows how model complexity, bias, variance, L1/L2 regularization, and dropout interact to shape training behavior and test performance.

1

Section 1

Generalization Explorer

As model complexity rises, training error often falls monotonically, but validation error may eventually rise. That turning point is the practical signature of overfitting.

Fit vs. Truth
True function Fitted model Train sample Validation sample
Controls
Metrics
Train MSE
Val MSE
Gen gap
Good fit

Generalization gap = Val MSE − Train MSE.

Underfitting: at low degree the curve is too rigid. Both train and validation errors stay high.
Overfitting: at high degree the curve threads through training noise. Train error keeps falling but validation error rises.
2

Section 2

Bias-Variance Tradeoff Explorer

Bias comes from a model family being too rigid. Variance comes from the fitted model changing too much when the dataset changes. A single fit cannot teach this — variance only becomes visible across repeated resampled datasets.

A. Many Fits

Each faint curve is a polynomial fit on a fresh resample of training data.

B. Average Predictor vs. Truth

If the average across resamples sits away from the truth, bias is high.

C. Estimated Decomposition

Empirical estimates over the prediction grid. Total proxy = bias² + variance + noise floor.

Controls
Empirical estimates
Bias²
Variance
Bias² + Var

This section uses repeated synthetic resampling to build empirical intuition for bias and variance — it is not a closed-form theorem result.

3

Section 3

L1 vs L2 Regularization Lab

Regularization adds a penalty that discourages overly complex solutions. L1 tends to prefer sparse solutions; L2 tends to shrink all weights more smoothly.

Mode
Strength λ

L1 reaches sparsity at smaller λ than L2 — that asymmetry is part of the lesson.

Dataset
A. Fitted Curve
True function Unregularized Current Data
B. Coefficients

L1 drives many coefficients toward zero. L2 shrinks them smoothly.

C. Objective Breakdown
Loss = MSE
Data MSE
Penalty
Total

Stacked: data loss (blue) + penalty (purple/green). Watch how moving λ trades them off.

Geometry mini-card: why L1 is sparse and L2 is round

L2: circular constraint region. The MSE contour usually meets it on a smooth interior point.

L1: diamond constraint region with sharp axes — solutions often land on a corner where one weight is zero.

4

Section 4

Dropout Intuition Explorer

Dropout reduces overfitting by forcing a neural network to succeed under many random subnetworks during training instead of relying on a single brittle path.

A. Subnetwork Diagram

Training mode: a fresh random mask drops some hidden units on every forward pass.

Controls
Active units
Subnetworks seen
0
B. Mask History

Each row is one forward pass during training. Filled cells show active hidden units in that pass — dropout samples a fresh subnetwork every time.

Training: each forward pass uses a fresh binary mask. Active activations are typically scaled by $1/(1-p)$ (inverted dropout) so the expected value is preserved.
Inference: dropout is off. The full network is used. With inverted dropout, no extra rescaling is needed at test time.

References

Slide anchors and further reading