Michael Data

Given some observation features, want to make predictions on another variable. When the target variable is discrete, this is a form of classification. When the target variable is continuous, it is a regression problem.

$D = \{(\underline{x_i},\underline{y_i})\}$
$\underline{x} = \begin{pmatrix} x_1 \\ \vdots \\ x_d \end{pmatrix} $

Want to predict the $y$s based on some $x$s. Learn a model $f(\underline{x} ; \underline{\theta})$, $\underline{\theta} = \begin{pmatrix} \theta_1 \\ \vdots \\ \theta_d \end{pmatrix} $

Linear models with squared error are historically well-used.

Includes Linear Regression and Logistic Regression.

Linear Models

Generally, linear models which have linear complexity in the number of parameters as data increases, can be non-linear in the $x$s.

e.g. $ f(x;\theta) = \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2 $ is non-linear in $x$, but equals $\underline{\theta}^T \underline{x}'$

Non-Linear Models

If $f(x;\theta)$ is non-linear in $\theta$, you get a set of non-linear equations to solve. Gradient techniques are often used here to find the $\theta$ that minimizes mean-squared error.

Probabilistic Interpretation of Regression

There is usually an implicit assumption that the training observations are IID samples from an underlying distribution $p(x,y)$. So the underlying problem contains a joint distribution $p(x,y) = p(y|x)p(x)$.

There are two types of variation to account for in data:

  • * measurement noise
  • * unobserved variables

There are two sources of variability in the regression function:

  • * $p(y|x)$ variability in $y$ for a given $x$
  • * $p(x)$ distribution of input data in input space

Typical modeling framework: $y_x = E[y|x] + e$

  • * $y_x$ is observed
  • * $y|x$ is the systematic or predicted part learned with $f(x;\theta)$
  • * $e$ is a zero-mean noise
  • considered unpredictable
  • referred to as the error term

Conditional Likelihood for Regression

Assuming that $y_i$s are conditionally independent given $\theta$ and $x$, conditional likelihood:
\begin{align*}
    *   L(\theta) &= p(D_y|D_x , \theta)
\\ &= \prod{i=1}^N p(y_i|x_i,\underline{\theta})
\end{align*}

Say we assume a Gaussian noise model: $p(y|x,\theta) = \mathcal{N}(f(x;\theta),\sigma^2)$. $f(x;\theta)$ is $E[y|x]$ and can be any linear or non-linear function with parameters $\theta$. $\sigma^2$ could also be an unknown parameter, but for now assume it is known.

$p(y|x,\underline{\theta}) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} \left( y_i - f(x_i;\theta) \right)^2} $

  • * $y_i$ is the observed $y$ with noise
  • * $f(x_i;\theta)$ is the model's predition for some $\theta$

Maximizing likelihood is the same as minimizing MSE.
\begin{align*}
    *   \log L(\theta) &= \sum_{i=1}^N log p(y_i|x_i,\underline{\theta})
\\ &= \sum{i=1}^N \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} \left( y_i - f(x_i;\theta) \right)^2}
\\ &\propto -\text{MSE}(\theta)
\end{align*}

$\sigma^2$ may be modeled as a function of $x$. May use non-Gaussian noise models, non-IID models, or Bayesian with priors on the parameters $\theta$. This goes beyond minimizing MSE.

Bayesian Interpretation of Regression

\begin{align*}
    *   p(D_y|D_x , \theta) &= \prod{i=1}^N p(y_i|x_i,\underline{\theta})
\\ p(\theta_j &= \mathcal{N}(0,\sigma^2) & p(\theta) = \prod_{d=1}^D p(\theta_j)
\\ &\rightarrow -\log p(\theta|D_x,D_y)
\\ &\propto -\log p(D_y|D_x,\theta) - \log p(\theta)
\\ &= \frac{1}{2 \sigma^2} \sum_{i=1}^N (y_i - f(x_i,\theta)^2 + \frac{1}{\sigma_\theta^2} \sum_{j=1}^d \theta_j^2
\\ &= \frac{1}{2 \sigma^2} \text{MSE}(\theta) + \frac{1}{2 \sigma^2} \sum_{j=1}^d \theta_j^2
\end{align*}
The first term is goodness of fit. The second is the regularization term. $\hat \theta_{MAP}$ is the minimization of this expression.

Where $\theta = y|x$, we want to learn
\begin{align*}
    *   P(\theta|D_x,D_y) &\propto p(D_x,D_y|\theta) p(\theta)
\\ &= p(D_y|D_x,\theta)p(D_x|\theta)p(\theta)
\\ &\propto p(D_y|D_x,\theta)p(theta)
\end{align*}

So there is no modeling of $p(D_x)$.

Gaussian error model for conditional likelihood: $L(\theta) = p(D_y|D_x,\theta) \propto \prod_{i=1}^N e^{-1/2\left(\frac{y_i-f(x_i;\theta)}{\sigma}}\right)^2$.
This models the $y_i$s as conditionally independent given $x_i$s and $\theta$.

$$p(y_i|x_i) = \mathcal{N}(f(x_i;\theta),\sigma^2)$$

It is common to have independent priors on the $\theta_j$s: $p(\underline{\theta}) = \prod_{j=1}^d p(\theta_j)$

A conjugate prior for $\theta_j$ is another Gaussian. $p(\underline{\theta}) = \prod_{j=1}^d \mathcal{N}(0,\sigma_0^2)$

\begin{align*}
    *   \log p(\theta|D_x,D_y) &\propto \log p(D_y|D_x,\underline{\theta}) + \log p(\theta)
\\ &\propto - \frac{1}{2\sigma^2} \sum_{i=1}^N \left(y_i-f(x_i,\underline{\theta})\right)^2 - \frac{1}{2\sigma_0^2} \sum_{j=1}^d \theta_j^2
\\ - \log p(\theta|D_x,D_y) &\propto \sum_{i=1}^N \left(y_i-f(x_i,\underline{\theta})\right)^2 + \lambda \sum_{j=1}^d \theta_j^2
\\ &= \text{squared error\ } + \lambda(\text{sum of squared weights}) & \text{where\ } \lambda=\frac{\sigma^2}{\sigma_0^2}
\end{align*}
$\hat \theta_{MAP}$ is the maximization of this expression, which is typically done with gradient techniques.

A less common prior is Laplacian, which corresponds to absolute error. This is known as the [http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method Lasso method].

Minimizing MSE

$$\min MSE(\theta) = \min_{\theta} \frac{1}{N}\left(y_i - f(x_i;\theta)\right)^2$$

Minimizing something like this on training data with intent to use it on predictions later assumes that the $x_i$s and $y_i$s are random samples from a fixed underlying distribution $p(x,y)=p(y|x)p(x)$. This assumption should be kept in mind.

$$\lim_{N \to \infty} MSE(\theta) = \int \int \left(y - f(x;\theta)\right)^2 p(x,y) dx dy $$

$ = E_{p(x,y)} \left[ \left(y - f(x;\theta)\right)^2 \right] $
= averate theoretical squared error using f as a predictor, with respect to $p(x,y)$.

This is true as long as $x_i,y_i$ pairs are samples from $p(x,y)$. Ideally, we would like to fin $f,\theta$ to minimize this.

Can rewrite as $\int \left[ \int \left(y_x - f(x_i;\theta)\right)^2 p(y|x) dy \right] p(x) dx $.

  • * $y_x$ is a random variable
  • * $f(x_i;\theta)$ is our deterministic prediction of y given x

$= \int MSE_x p(x) dx$
= weighted MSE with respect to $p(x)$. This is relevant in practical problems where $p(x)$ is different from training and test data. Changes in $p(x)$ are sometimes called covariate shift.

Theoretical Properties of Minimizing MSE

$$ MSE_x = \int \left(y_x - f(x_i;\theta)\right)^2 p(y|x) dy $$

$$ = \int \left(y_x - E[y_x] + E[y_x] - f(x;\theta) \right)^2 p(y|x) dy $$

cross terms drop out

$$ = \int \left(y_x - E[y_x] \right)^2 p(y|x) dy + \int \left[E[y_x] - f(x;\theta) \right]^2 p(y|x) dy $$

The first term is $\sigma_{y|x}^2$, variation in y at a particular x. We have no control over this. We can then work to minimize the second term by selecting $\theta$.

$$ = \sigma_{y|x}^2 + \int \left[E[y_x] - f(x;\theta) \right]^2 p(y|x) dy $$

The lower bound is achieved when we have the optimal predictor $f(x;\theta) = E[y_x]$. In this case, $MSE_x \geq \sigma_{y|x}^2$.

In theory, we just set $f(x;\theta) = E[y_x]$ and we have the optimal predictor. In practice, we have to learn $E[y_x]$ and are limited by the Bias-Variance Tradeoff.

  • * bias: $f(x;\theta)$ might not be able to exactly approximate the unknown $E[y_x]$.
  • * variance: even if it can, we only have finite data to learn $E[y_x]$.