Bias-Variance & Regularization Explorer

1

Section 1

Generalization Explorer

As model complexity rises, training error often falls monotonically, but validation error may eventually rise. That turning point is the practical signature of overfitting.

Fit vs. Truth

True function Fitted model Train sample Validation sample

Controls

Dataset

Polynomial degree3

Training samples25

Validation samples40

Show true function Show training points Show validation points

Metrics

Train MSE

—

Val MSE

—

Gen gap

—

Good fit

Generalization gap = Val MSE − Train MSE.

Underfitting: at low degree the curve is too rigid. Both train and validation errors stay high.

Overfitting: at high degree the curve threads through training noise. Train error keeps falling but validation error rises.

2

Section 2

Bias-Variance Tradeoff Explorer

Bias comes from a model family being too rigid. Variance comes from the fitted model changing too much when the dataset changes. A single fit cannot teach this — variance only becomes visible across repeated resampled datasets.

A. Many Fits

Each faint curve is a polynomial fit on a fresh resample of training data.

B. Average Predictor vs. Truth

If the average across resamples sits away from the truth, bias is high.

C. Estimated Decomposition

Empirical estimates over the prediction grid. Total proxy = bias² + variance + noise floor.

Controls

Polynomial degree5

Noise level (σ)0.6

Resamples25

Samples per dataset25

Show average predictor Highlight one fit

Empirical estimates

Bias²

—

Variance

—

Bias² + Var

—

This section uses repeated synthetic resampling to build empirical intuition for bias and variance — it is not a closed-form theorem result.

3

Section 3

L1 vs L2 Regularization Lab

Regularization adds a penalty that discourages overly complex solutions. L1 tends to prefer sparse solutions; L2 tends to shrink all weights more smoothly.

Mode

Side-by-side compare

Strength λ

λ0.10

L1 reaches sparsity at smaller λ than L2 — that asymmetry is part of the lesson.

Dataset

A. Fitted Curve

True function Unregularized Current Data

B. Coefficients

L1 drives many coefficients toward zero. L2 shrinks them smoothly.

C. Objective Breakdown

Loss = MSE

Data MSE

—

Penalty

—

Total

—

Stacked: data loss (blue) + penalty (purple/green). Watch how moving λ trades them off.

Geometry mini-card: why L1 is sparse and L2 is round

L2: circular constraint region. The MSE contour usually meets it on a smooth interior point.

L1: diamond constraint region with sharp axes — solutions often land on a corner where one weight is zero.

4

Section 4

Dropout Intuition Explorer

Dropout reduces overfitting by forcing a neural network to succeed under many random subnetworks during training instead of relying on a single brittle path.

A. Subnetwork Diagram

Training mode: a fresh random mask drops some hidden units on every forward pass.

Controls

Dropout rate p0.40

Hidden width10

Hidden layers

Active units

—

Subnetworks seen

0

B. Mask History

Each row is one forward pass during training. Filled cells show active hidden units in that pass — dropout samples a fresh subnetwork every time.

Training: each forward pass uses a fresh binary mask. Active activations are typically scaled by $1/(1-p)$ (inverted dropout) so the expected value is preserved.

Inference: dropout is off. The full network is used. With inverted dropout, no extra rescaling is needed at test time.

References

Slide anchors and further reading

Slides/ML_Lecture3Sp26_ML_Review.pdf page 2 — definition and causes of overfitting.
Slides/ML_Lecture3Sp26_ML_Review.pdf pages 3-5 — bias and variance definitions.
Slides/ML_Lecture3Sp26_ML_Review.pdf page 6 — preventing overfitting (data, simpler models, regularization, dropout).
Slides/ML_Lecture3Sp26_ML_Review.pdf page 7 — regularization definition, L1 / L2 effect summary.
Slides/ML_Lecture3Sp26_ML_Review.pdf page 9 — L1 vs L2 comparison.
Slides/ML_Lecture3Sp26_ML_Review.pdf pages 10-11 — L2 and L1 geometric intuition.
Slides/ML_LectureSp26_CNN.pdf pages 69-76 — variance stability and batch normalization.