**Linear models** are models that are linear in the parameter $\theta$. So the computation complexity is linear as a function of the number of parameters.
e.g. $f(x ; \theta) = \sum_{j=1}^d \theta_j x_j = \theta^t x$

The functional form can have non-linear functional form while still being linear in the number of parameters. This is still considered a “linear” model.

Training data: set of $(x_i,y_i)$ pairs, where $y \in R$. $f(x; \theta)$ is the prediction model.

By using a linear model and a least-squares fitting method, the objective function remains convex.

A problem with linear models is that they can return results which are negative or otherwise not nicely interpretable as probabilities. A common solution is to use a logistic function and perform Logistic Regression.

- $f(x ; \theta) = \sum_{k=1}^K g_k(x_i;\theta_k)$
- $f(x ; \theta) = \frac{1}{1 + e^{-\theta^t x}}$
- = $g(\theta^t x)$
- This is an example of a generalized linear model where $g()$ is referred to as the
`link function`

A Least Squares error function means $MSE(\underline{\theta})$ is a concave function, so we can use gradient descent to find $\hat \theta_{MSE}$

$\underline{\theta}^{new} = \underline{\theta}^{current} - \alpha \nabla ( MSE(\underline{\theta}^{current})$ where $\alpha$ is a step size and $\nabla$ is a gradient

$mse(\theta) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i ; \theta))^2 = (Y_D - X_D \underline{\theta})^T(Y_D-X_D\underline{\theta})$

- * $y_i$ is the target
- * $f(x_i ; \theta) = \underline{\theta}^+ \underline{x}$ is the prediction where $X_D = \begin{pmatrix} \underline{x}_1^T

\vdots

\underline{x}_N^T \end{pmatrix}$ is an N-by-d matrix and $Y_D = \begin{pmatrix} \underline{y}_1

\vdots

\underline{y}_N \end{pmatrix}$ is an N-by-1 vector.

Can show that mse is minimized when

- * $X_D^T X_D \theta = X_D^T Y$ or the form $A \theta = B$, where in this form, theta is a set of d linear equations
- * $\hat \theta_{MSE} = (X_D^T X_D)^{-1} X_D^T Y_D$

If you have a low-dimensional problem and lots of data, this is “easy”. If it is a high-dimensional problem and some of them are codependent, then the problem can be underdetermined and difficult or impossible to solve this way.

Both methods have complexity $O(Nd^2 + d^3)$. The $Nd^2$ is for creating the $X_D^T X_D$ matrix, and the $d^3$ is for solving $d$ linear equations or inverting a $d$-by-$d$ matrix.

Because $MSE(\underline{\theta})$ is a concave function, in principle you can use gradient descent to find $\hat \theta_{MSE}$.

Assume (x_i,y_i) are IID samples from some underlying density $p(x,y) = p(y|x)p(x)$. $\displaystyle \lim_{n \to \infty} mse(\theta) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i ; \theta))^2 = \int \int (y_i - f(x_i ; \theta))^2 p(x,y) dy dx $ $\displaystyle = \int_x [ \int_y(y_x - f(x_i ; \theta))^2 p(y|x) dy] p(x) dx = \int MSE(\theta) p(x) dx$

- * $y_x$ is a distribution over y for a fixed x value. It may vary due to noise or variance.
- * So the inne bracketed term becomes the expected value of mse for a given x, due to uncertainty of y.

When $MSE_x(\theta) = \int (y_i - f(x_i ; \theta))^2 p(y|x) dy $ The inner error term is the part affected by the parameters: $ = \int (y - E(y|x) + E(y|x) - f(x ; \theta))^2$ $ = \int (y - E(y|x))^2 p(y|x) dy + \int (E(y|x) - f(x ; \theta))^2 p(y|x)dy$

- * The first term is the “natural” variability in y at a particular x value
- * The second term can be manipulated by controlling theta
- * So the optimal model will minimize the second term by setting $f(x ; \theta) = E(y|x)$

In practice, regression is also limited by the Bias-Variance Tradeoff.