# Michael Data

Given some observation features, want to make predictions on another variable. When the target variable is discrete, this is a form of classification. When the target variable is continuous, it is a regression problem.

Want to predict the s based on some s. Learn a model ,

Linear models with squared error are historically well-used.

Includes Linear Regression and Logistic Regression.

##### Linear Models

Generally, linear models which have linear complexity in the number of parameters as data increases, can be non-linear in the s.

e.g. is non-linear in , but equals

##### Non-Linear Models

If is non-linear in , you get a set of non-linear equations to solve. Gradient techniques are often used here to find the that minimizes mean-squared error.

##### Probabilistic Interpretation of Regression

There is usually an implicit assumption that the training observations are IID samples from an underlying distribution . So the underlying problem contains a joint distribution .

There are two types of variation to account for in data:

• * measurement noise
• * unobserved variables

There are two sources of variability in the regression function:

• * variability in for a given
• * distribution of input data in input space

Typical modeling framework:

• * is observed
• * is the systematic or predicted part learned with
• * is a zero-mean noise
• considered unpredictable
• referred to as the error term

#### Conditional Likelihood for Regression

Assuming that s are conditionally independent given and , conditional likelihood:

Say we assume a Gaussian noise model: . is and can be any linear or non-linear function with parameters . could also be an unknown parameter, but for now assume it is known.

• * is the observed with noise
• * is the model's predition for some

Maximizing likelihood is the same as minimizing MSE.

may be modeled as a function of . May use non-Gaussian noise models, non-IID models, or Bayesian with priors on the parameters . This goes beyond minimizing MSE.

#### Bayesian Interpretation of Regression

The first term is goodness of fit. The second is the regularization term. is the minimization of this expression.

Where , we want to learn

So there is no modeling of .

Gaussian error model for conditional likelihood: .
This models the s as conditionally independent given s and .

It is common to have independent priors on the s:

A conjugate prior for is another Gaussian.

is the maximization of this expression, which is typically done with gradient techniques.

A less common prior is Laplacian, which corresponds to absolute error. This is known as the [http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method Lasso method].

##### Minimizing MSE

Minimizing something like this on training data with intent to use it on predictions later assumes that the s and s are random samples from a fixed underlying distribution . This assumption should be kept in mind.

= averate theoretical squared error using f as a predictor, with respect to .

This is true as long as pairs are samples from . Ideally, we would like to fin to minimize this.

Can rewrite as .

• * is a random variable
• * is our deterministic prediction of y given x

= weighted MSE with respect to . This is relevant in practical problems where is different from training and test data. Changes in are sometimes called covariate shift.

#### Theoretical Properties of Minimizing MSE

cross terms drop out

The first term is , variation in y at a particular x. We have no control over this. We can then work to minimize the second term by selecting .

The lower bound is achieved when we have the optimal predictor . In this case, .

In theory, we just set and we have the optimal predictor. In practice, we have to learn and are limited by the Bias-Variance Tradeoff.

• * bias: might not be able to exactly approximate the unknown .
• * variance: even if it can, we only have finite data to learn .