Regression Model Zoo & Outlier Sensitivity

Section 1

Regression Playground

Pick a ground-truth shape, then add noise and outliers. The plot below shows the data the four models will see — outliers are highlighted so the contamination stays visible.

Ground-truth preset

$y_{\text{true}}(x) = 0.15x^3 - 0.8x$, with $x \in [-4, 4]$.

Data shape

Samples30

Noise σ0.40

Outliers

Count3

Magnitude4.0

Position

training samples outliers $y_{\text{true}}(x)$

Manual outliers stay until you resample or change the preset. Validation points are drawn from the same ground truth without contamination.

Section 2

Model Zoo Comparison

Four model families on one dataset. Toggle visibility, compare fits, and tune the per-model knobs to feel each model's inductive bias.

Focus

Linear Polynomial Tree SVR ε-tube SVR support vector

Polynomial

Degree d3

Decision tree

Max depth4

Support-vector regression

ε (tube width)0.30

C (penalty)5.0

Kernel

Comparison table

	Model	Flexibility	Prediction style	Train MSE	Validation MSE	Outlier sensitivity

Validation MSE is computed on a clean held-out grid drawn from the same ground-truth function, so it rewards models that capture the underlying signal rather than the noise or the outliers.

Section 3

Outlier Sensitivity Lab

Now make outlier robustness the focus. Push the magnitude and count up — linear regression gets pulled globally, high-degree polynomials contort, trees absorb the shock locally with extra splits, and SVR's ε-tube and $C$ knob let it trade fit for robustness.

Outlier knobs (mirrored)

Count3

Magnitude4.0

Position

SVR · C in focus

C5.0

Small $C$ → wider tolerance → smoother, outlier-resistant fit.
Large $C$ → tight fit, the SVR chases difficult points.

Show fit before outliers

Without outliers

With outliers

Influence summary

Curve shift averages $|f_{\text{with}}(x) - f_{\text{without}}(x)|$ over a dense grid in $[-4, 4]$. Larger values mean outliers reshaped the prediction more.

SVR · C demonstration

$C = 0.5$ — relaxed

$C = 5$ — balanced

$C = 80$ — strict

Same data, same kernel, different penalty. With small $C$, residuals outside the ε-tube cost little, so the fit stays smooth and ignores outliers. With large $C$, those same residuals cost a lot and the fit bends to accommodate them — exactly the trade-off described on slide 18 of the lecture deck.

Takeaway. Outliers reveal each model's inductive bias. Linear models can be pulled globally, high-degree polynomials can contort, trees react locally through extra splits, and SVR can trade off robustness and fit through $C$ and $\varepsilon$.

Section 4

How The Models Work

Each card pairs a one-glance illustration with the structural assumption the model is making. Read these alongside the comparison plot above to see why the curves take the shapes they do.

Linear regression

Fits one global straight line $\hat{y} = w x + b$ by minimizing squared residuals. Simple, interpretable, and the squared loss penalizes large residuals heavily — so a single far-off outlier can drag the line substantially.

Polynomial regression

Adds polynomial features $x, x^2, \dots, x^d$ and fits a linear model in that lifted space. Higher degree means more flexibility, but also more capacity to wiggle around noise and outliers — classic bias–variance tension.

Decision tree regression

Recursively partitions the input space and predicts the regional mean. The result is piecewise constant — flat steps with hard jumps at split boundaries. Distant outliers usually distort just one or two leaves rather than the global shape.

Support-vector regression

Allows free errors inside an ε-tube around the fit and penalizes only points outside it. Those outside points become support vectors. The penalty $C$ controls how hard the model insists on shrinking those violations: smaller $C$ → smoother and more robust, larger $C$ → tighter fit.

Implementation notes. Linear and polynomial fits use ordinary least squares solved via normal equations with a small Tikhonov regularizer for numerical stability at high degree (and inputs are rescaled to $[-1, 1]$ before lifting). The decision tree is a faithful 1D CART — greedy squared-error splits down to the depth limit, with leaves predicting the regional mean. SVR is solved with dual coordinate descent (libsvm-style) on the ε-insensitive loss: each per-coordinate update is the closed-form soft-threshold-then-clip $\beta_i \in [-C, C]$, with linear, RBF, and polynomial kernels.

Sources

Slide anchors

ML_Lecture3Sp26_ML_Review.pdf · page 13 — linear & polynomial regression
pages 14–15 — decision-tree regression (and random forest)
pages 16–17 — support-vector regression and the ε-tube
page 18 — outlier resistance and the role of $C$
pages 20–29, 68 — gradient boosting and random forest aggregation (extension)