Given some observation features, want to make predictions on another variable. When the target variable is discrete, this is a form of classification. When the target variable is continuous, it is a **regression** problem.

$D = \{(\underline{x_i},\underline{y_i})\}$
$\underline{x} = \begin{pmatrix} x_1

\vdots

x_d \end{pmatrix} $

Want to predict the $y$s based on some $x$s. Learn a model $f(\underline{x} ; \underline{\theta})$, $\underline{\theta} = \begin{pmatrix} \theta_1

\vdots

\theta_d \end{pmatrix} $

Linear models with squared error are historically well-used.

Includes Linear Regression and Logistic Regression.

Generally, **linear models** which have linear complexity in the number of parameters as data increases, can be non-linear in the $x$s.

e.g. $ f(x;\theta) = \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2 $ is non-linear in $x$, but equals $\underline{\theta}^T \underline{x}'$

If $f(x;\theta)$ is non-linear in $\theta$, you get a set of non-linear equations to solve. Gradient techniques are often used here to find the $\theta$ that minimizes mean-squared error.

There is usually an implicit assumption that the training observations are IID samples from an underlying distribution $p(x,y)$. So the underlying problem contains a joint distribution $p(x,y) = p(y|x)p(x)$.

There are two types of variation to account for in data:

- * measurement noise
- * unobserved variables

There are two sources of variability in the regression function:

- * $p(y|x)$ variability in $y$ for a given $x$
- * $p(x)$ distribution of input data in input space

Typical modeling framework: $y_x = E[y|x] + e$

- * $y_x$ is observed
- * $y|x$ is the systematic or predicted part learned with $f(x;\theta)$
- * $e$ is a zero-mean noise
- considered unpredictable
- referred to as the
**error term**

Assuming that $y_i$s are conditionally independent given $\theta$ and $x$, conditional likelihood: <latex>\begin{align*}

- L(\theta) &= p(D_y|D_x , \theta)

&= \prod{i=1}^N p(y_i|x_i,\underline{\theta})
\end{align*}</latex>

Say we assume a Gaussian noise model: $p(y|x,\theta) = \mathcal{N}(f(x;\theta),\sigma^2)$. $f(x;\theta)$ is $E[y|x]$ and can be any linear or non-linear function with parameters $\theta$. $\sigma^2$ could also be an unknown parameter, but for now assume it is known.

$p(y|x,\underline{\theta}) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} \left( y_i - f(x_i;\theta) \right)^2} $

- * $y_i$ is the observed $y$ with noise
- * $f(x_i;\theta)$ is the model's predition for some $\theta$

Maximizing likelihood is the same as minimizing MSE. <latex>\begin{align*}

- \log L(\theta) &= \sum_{i=1}^N log p(y_i|x_i,\underline{\theta})

&= \sum{i=1}^N \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} \left( y_i - f(x_i;\theta) \right)^2}

&\propto -\text{MSE}(\theta)
\end{align*}</latex>

$\sigma^2$ may be modeled as a function of $x$. May use non-Gaussian noise models, non-IID models, or Bayesian with priors on the parameters $\theta$. This goes beyond minimizing MSE.

<latex>\begin{align*}

- p(D_y|D_x , \theta) &= \prod{i=1}^N p(y_i|x_i,\underline{\theta})

p(\theta_j &= \mathcal{N}(0,\sigma^2) & p(\theta) = \prod_{d=1}^D p(\theta_j)

&\rightarrow -\log p(\theta|D_x,D_y)

&\propto -\log p(D_y|D_x,\theta) - \log p(\theta)

&= \frac{1}{2 \sigma^2} \sum_{i=1}^N (y_i - f(x_i,\theta)^2 + \frac{1}{\sigma_\theta^2} \sum_{j=1}^d \theta_j^2

&= \frac{1}{2 \sigma^2} \text{MSE}(\theta) + \frac{1}{2 \sigma^2} \sum_{j=1}^d \theta_j^2
\end{align*}</latex>
The first term is **goodness of fit**. The second is the **regularization** term. $\hat \theta_{MAP}$ is the minimization of this expression.

Where $\theta = y|x$, we want to learn <latex>\begin{align*}

- P(\theta|D_x,D_y) &\propto p(D_x,D_y|\theta) p(\theta)

&= p(D_y|D_x,\theta)p(D_x|\theta)p(\theta)

&\propto p(D_y|D_x,\theta)p(theta)
\end{align*}</latex>

So there is no modeling of $p(D_x)$.

**Gaussian error model for conditional likelihood**: $L(\theta) = p(D_y|D_x,\theta) \propto \prod_{i=1}^N e^{-1/2\left(\frac{y_i-f(x_i;\theta)}{\sigma}}\right)^2$.
This models the $y_i$s as conditionally independent given $x_i$s and $\theta$.
$$p(y_i|x_i) = \mathcal{N}(f(x_i;\theta),\sigma^2)$$
It is common to have independent priors on the $\theta_j$s: $p(\underline{\theta}) = \prod_{j=1}^d p(\theta_j)$

A conjugate prior for $\theta_j$ is another Gaussian. $p(\underline{\theta}) = \prod_{j=1}^d \mathcal{N}(0,\sigma_0^2)$

<latex>\begin{align*}

- \log p(\theta|D_x,D_y) &\propto \log p(D_y|D_x,\underline{\theta}) + \log p(\theta)

&\propto - \frac{1}{2\sigma^2} \sum_{i=1}^N \left(y_i-f(x_i,\underline{\theta})\right)^2 - \frac{1}{2\sigma_0^2} \sum_{j=1}^d \theta_j^2

- \log p(\theta|D_x,D_y) &\propto \sum_{i=1}^N \left(y_i-f(x_i,\underline{\theta})\right)^2 + \lambda \sum_{j=1}^d \theta_j^2

&= \text{squared error\ } + \lambda(\text{sum of squared weights}) & \text{where\ } \lambda=\frac{\sigma^2}{\sigma_0^2}
\end{align*}</latex>
$\hat \theta_{MAP}$ is the maximization of this expression, which is typically done with gradient techniques.

A less common prior is Laplacian, which corresponds to absolute error. This is known as the [http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method Lasso method].

$$\min MSE(\theta) = \min_{\theta} \frac{1}{N}\left(y_i - f(x_i;\theta)\right)^2$$ Minimizing something like this on training data with intent to use it on predictions later assumes that the $x_i$s and $y_i$s are random samples from a fixed underlying distribution $p(x,y)=p(y|x)p(x)$. This assumption should be kept in mind.

$$\lim_{N \to \infty} MSE(\theta) = \int \int \left(y - f(x;\theta)\right)^2 p(x,y) dx dy $$ $ = E_{p(x,y)} \left[ \left(y - f(x;\theta)\right)^2 \right] $ = averate theoretical squared error using f as a predictor, with respect to $p(x,y)$.

This is true as long as $x_i,y_i$ pairs are samples from $p(x,y)$. Ideally, we would like to fin $f,\theta$ to minimize this.

Can rewrite as $\int \left[ \int \left(y_x - f(x_i;\theta)\right)^2 p(y|x) dy \right] p(x) dx $.

- * $y_x$ is a random variable
- * $f(x_i;\theta)$ is our deterministic prediction of y given x

$= \int MSE_x p(x) dx$
= weighted MSE with respect to $p(x)$. This is relevant in practical problems where $p(x)$ is different from training and test data. Changes in $p(x)$ are sometimes called **covariate shift**.

$$ MSE_x = \int \left(y_x - f(x_i;\theta)\right)^2 p(y|x) dy $$ $$ = \int \left(y_x - E[y_x] + E[y_x] - f(x;\theta) \right)^2 p(y|x) dy $$ cross terms drop out $$ = \int \left(y_x - E[y_x] \right)^2 p(y|x) dy + \int \left[E[y_x] - f(x;\theta) \right]^2 p(y|x) dy $$ The first term is $\sigma_{y|x}^2$, variation in y at a particular x. We have no control over this. We can then work to minimize the second term by selecting $\theta$. $$ = \sigma_{y|x}^2 + \int \left[E[y_x] - f(x;\theta) \right]^2 p(y|x) dy $$ The lower bound is achieved when we have the optimal predictor $f(x;\theta) = E[y_x]$. In this case, $MSE_x \geq \sigma_{y|x}^2$.

In theory, we just set $f(x;\theta) = E[y_x]$ and we have the optimal predictor. In practice, we have to learn $E[y_x]$ and are limited by the Bias-Variance Tradeoff.

- * bias: $f(x;\theta)$ might not be able to exactly approximate the unknown $E[y_x]$.
- * variance: even if it can, we only have finite data to learn $E[y_x]$.