Given some observation features, want to make predictions on another variable. When the target variable is discrete, this is a form of classification. When the target variable is continuous, it is a **regression** problem.

Want to predict the s based on some s. Learn a model ,

Linear models with squared error are historically well-used.

Includes Linear Regression and Logistic Regression.

Generally, **linear models** which have linear complexity in the number of parameters as data increases, can be non-linear in the s.

e.g. is non-linear in , but equals

If is non-linear in , you get a set of non-linear equations to solve. Gradient techniques are often used here to find the that minimizes mean-squared error.

There is usually an implicit assumption that the training observations are IID samples from an underlying distribution . So the underlying problem contains a joint distribution .

There are two types of variation to account for in data:

- * measurement noise
- * unobserved variables

There are two sources of variability in the regression function:

- * variability in for a given
- * distribution of input data in input space

Typical modeling framework:

- * is observed
- * is the systematic or predicted part learned with
- * is a zero-mean noise
- considered unpredictable
- referred to as the
**error term**

Assuming that s are conditionally independent given and , conditional likelihood:

Say we assume a Gaussian noise model: . is and can be any linear or non-linear function with parameters . could also be an unknown parameter, but for now assume it is known.

- * is the observed with noise
- * is the model's predition for some

Maximizing likelihood is the same as minimizing MSE.

may be modeled as a function of . May use non-Gaussian noise models, non-IID models, or Bayesian with priors on the parameters . This goes beyond minimizing MSE.

The first term is **goodness of fit**. The second is the **regularization** term. is the minimization of this expression.

Where , we want to learn

So there is no modeling of .

**Gaussian error model for conditional likelihood**: .

This models the s as conditionally independent given s and .

It is common to have independent priors on the s:

A conjugate prior for is another Gaussian.

is the maximization of this expression, which is typically done with gradient techniques.

A less common prior is Laplacian, which corresponds to absolute error. This is known as the [http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method Lasso method].

Minimizing something like this on training data with intent to use it on predictions later assumes that the s and s are random samples from a fixed underlying distribution . This assumption should be kept in mind.

= averate theoretical squared error using f as a predictor, with respect to .

This is true as long as pairs are samples from . Ideally, we would like to fin to minimize this.

Can rewrite as .

- * is a random variable
- * is our deterministic prediction of y given x

= weighted MSE with respect to . This is relevant in practical problems where is different from training and test data. Changes in are sometimes called **covariate shift**.

cross terms drop out

The first term is , variation in y at a particular x. We have no control over this. We can then work to minimize the second term by selecting .

The lower bound is achieved when we have the optimal predictor . In this case, .

In theory, we just set and we have the optimal predictor. In practice, we have to learn and are limited by the Bias-Variance Tradeoff.

- * bias: might not be able to exactly approximate the unknown .
- * variance: even if it can, we only have finite data to learn .