Michael Data

Linear Regression Models

Linear models are models that are linear in the parameter . So the computation complexity is linear as a function of the number of parameters.
e.g. The functional form can have non-linear functional form while still being linear in the number of parameters. This is still considered a “linear” model.

Training data: set of pairs, where . is the prediction model.

By using a linear model and a least-squares fitting method, the objective function remains convex.

A problem with linear models is that they can return results which are negative or otherwise not nicely interpretable as probabilities. A common solution is to use a logistic function and perform Logistic Regression.

Examples of non-linear models

• • • = • This is an example of a generalized linear model where is referred to as the link function
Gradient Descent Linear Regression

A Least Squares error function means is a concave function, so we can use gradient descent to find  where is a step size and is a gradient

Minimizing MSE • * is the target
• * is the prediction where is an N-by-d matrix and is an N-by-1 vector.

Can show that mse is minimized when

• * or the form , where in this form, theta is a set of d linear equations
• * If you have a low-dimensional problem and lots of data, this is “easy”. If it is a high-dimensional problem and some of them are codependent, then the problem can be underdetermined and difficult or impossible to solve this way.

Both methods have complexity . The is for creating the matrix, and the is for solving linear equations or inverting a -by- matrix.

Because is a concave function, in principle you can use gradient descent to find .

Theoretical Properties

Assume (x_i,y_i) are IID samples from some underlying density .  • * is a distribution over y for a fixed x value. It may vary due to noise or variance.
• * So the inne bracketed term becomes the expected value of mse for a given x, due to uncertainty of y.

When The inner error term is the part affected by the parameters:  • * The first term is the “natural” variability in y at a particular x value
• * The second term can be manipulated by controlling theta
• * So the optimal model will minimize the second term by setting  