The **bias-variance tradeoff** is a practical issue that applies to many predictive modeling problems, particularly regression methods using gradient descent.

There is a fundamental tradeoff between vias and variance. This is known as the **no free lunch** theorem.
**Overcomplete** or over-complex models tend to suffer from variance. Over-simplistic models tend to suffer from bias.

a.k.a. **approximation error**.
Bias: expected difference between our model's predictions and the true targets. Can be reduced by making the model more complex and flexible. A high-bias model is one with few parameters e.g. a linear predictor. A low-bias model is one with many parameters e.g. a large neural network.

a.k.a. **estimation error**.
Variance: variability in the model's predictions based on observations. Can be reduced by increasing the number of observations.
For a particular $x$,

Increases in model complexity tend to increase variance and decrease bias.

$MSE_x(\theta) = \sigma_{y|x}^2 + \int (y_i - f(x_i ; \theta))^2 p(y|x) dy $ $MSE = E_{p(x)} [ MSE_x ] $ $MSE = \sigma_y^2 + Bias^2 + Variance$

- * $\sigma_y^2$ represents the inherent variability in y
- * Bias is “How closely we can approximate E[y|x]” in theory with optimal parameters
- * Variance is “How sensitive our parameter estimates are to training data”. How much the parameters will vary accross training sets of some size.

So $MSE_x = \sigma_y^2 + \text{Bias}^2 + \text{Variance}$.

For data set $D$ of size $N$, we assume that each observation is of the form $(x_i,y_i)$ pairs.

$p(D)$ is a distribution over all possible data sets of size $N$ (frequentist view). With respect to $p(D)$, $f(x;\theta)$ is a random quantity because $\theta$ is random with respect to $p(D)$.

Can define <latex>\begin{align*}

- \bar f_x &= E_{p(D)}[f(x;\theta)] = E_{p(D)}[\left(E[y_x] - f(x;\theta)\right)^2]

&= E_{p(D)}\left[\left( E[y_x] - \bar f_x + \bar f_x - f(x;\theta)\right)^2\right] & \text{cross terms cancel out}

&= E_{p(D)}\left[\left( E[y_x] - \bar f_x \right)^2\right] + E_{p(D)}\left[\left(\bar f_x - f(x;\theta)\right)^2\right]

&= \text{Bias}^2 + \text{Variance}
\end{align*}</latex>

this is the average error at x, now averaged over multiple possible data sets of size $N$.