Michael Data

The bias-variance tradeoff is a practical issue that applies to many predictive modeling problems, particularly regression methods using gradient descent.

There is a fundamental tradeoff between vias and variance. This is known as the no free lunch theorem.
Overcomplete or over-complex models tend to suffer from variance. Over-simplistic models tend to suffer from bias.

Bias

a.k.a. approximation error.
Bias: expected difference between our model's predictions and the true targets. Can be reduced by making the model more complex and flexible. A high-bias model is one with few parameters e.g. a linear predictor. A low-bias model is one with many parameters e.g. a large neural network.

Variance

a.k.a. estimation error.
Variance: variability in the model's predictions based on observations. Can be reduced by increasing the number of observations.
For a particular $x$,

Theoretical Properties

Increases in model complexity tend to increase variance and decrease bias.

$MSE_x(\theta) = \sigma_{y|x}^2 + \int (y_i - f(x_i ; \theta))^2 p(y|x) dy $
$MSE = E_{p(x)} [ MSE_x ] $
$MSE = \sigma_y^2 + Bias^2 + Variance$

  • * $\sigma_y^2$ represents the inherent variability in y
  • * Bias is “How closely we can approximate E[y|x]” in theory with optimal parameters
  • * Variance is “How sensitive our parameter estimates are to training data”. How much the parameters will vary accross training sets of some size.

So $MSE_x = \sigma_y^2 + \text{Bias}^2 + \text{Variance}$.

For data set $D$ of size $N$, we assume that each observation is of the form $(x_i,y_i)$ pairs.

$p(D)$ is a distribution over all possible data sets of size $N$ (frequentist view). With respect to $p(D)$, $f(x;\theta)$ is a random quantity because $\theta$ is random with respect to $p(D)$.

Can define
\begin{align*}
    *  \bar f_x &= E_{p(D)}[f(x;\theta)] = E_{p(D)}[\left(E[y_x] - f(x;\theta)\right)^2]
\\ &= E_{p(D)}\left[\left( E[y_x] - \bar f_x + \bar f_x - f(x;\theta)\right)^2\right] & \text{cross terms cancel out}
\\ &= E_{p(D)}\left[\left( E[y_x] - \bar f_x \right)^2\right] + E_{p(D)}\left[\left(\bar f_x - f(x;\theta)\right)^2\right]
\\ &= \text{Bias}^2 + \text{Variance}
\end{align*}

this is the average error at x, now averaged over multiple possible data sets of size $N$.