Michael Data

Machine Learning is a class of techniques for learning a model of data, frequently used in data mining and business intelligence.

Learning refers to learning the parameters of a model. It is also referred to as inference in Bayesian methods. Once a model has been learned, it might be used for predictive modeling or exploratory data analysis.

In a generative model, there is also usually a way to “go back” and generate or simulate data by fixing the parameters.

From a Bayesian framework, we would want to estimate something like $P(\theta|D) \propto P(D|\theta) P(\theta)$.
The prior probability over $\theta$ can be controvercial, so ignoring it or using a flat prior, the likelihood alone is typically used to carry the parameter estimation.

Likelihood

Conditional Likelihood

$D = \{(x_i,c_i)\}$ for $i=1\dots N$, $\theta_i \in \{0,d\}$ and $c_i \in \{0,1\}$.
Can write likelihood as $\prod_{i=1}^N p(x_i,c_i)$ using conditional independence.

\begin{align*}
    *   L(\theta) &= \prod_{i=1}^N p(c_i|x_i,\theta) p(x_i) & \text{but will ignore the $p(x_i)$ term}
\\ p(c_i = 1|x_i,\theta) &= \prod_{i=1}^N p(c_i|x_i,\theta) & \text{assuming CI}
\end{align*}

This is a conditional model, but hasn't defined the actual probabilities yet. The one which is most commonly used in machine learning is the logistic function:

$$ p(c_i=1|x_i) = \frac{1}{1 + e^{-(w^t x_i + w_0)}} $$

$$ p(c=1|x,\alpha) = \frac{1}{1 + e^{-(\alpha_0 + \sum_{j=1}^d \alpha_j x_j)}}$$

It defines a linear decision boundary. This is logistic regression, an example of a generalized linear model.

Generative Approach

<Merge with Classification>

Learn a model for a joint distribution $p(x_i,c_i)$ to predict class label of a new vector $x^*$.
We can compute $p(c=k|x^* , \hat\theta_{ML} , \hat \alpha_{ML})$ using Bayes' rule.

$D = \{(\underline{x}_i,c_i)\}$.
\begin{align*}
$L(\theta) = p(D|\theta) &= \prod_{i=1}^N p(x_i|c_i,\theta)p(c_i)
\\ &= \prod_{k=1}^N \left[ \prod_{c_i=k} p(x_i|c_i=k,\theta_k)p(c_i=k) \right]
\end{align*}

Key Points

  • * Learn a model for how the $x$s are distriubted for each class
  • i.e. $p(x|c=k,\theta_k)$ uses parameters for class $k$
  • requires a distributional/parametric assumption
  • e.g. Gaussian multivariate model
  • * Also have to learn $p(c=k)$ values, though this is easy
  • * Likelihood decomposes into $k$ separate optimization problems if $\theta_k$s are unrelated
  • * This approach is theoretically optimal if:
  • The distributional assumptions are correct
  • we can learn the true/optimal parameters
  • * Predict using Bayes' rule: $P(c=k|x,\theta) \propto p(x|c=k,\theta_k)p(c=k)$

Gaussian Example

$p(x|c=k,\theta_k) = \mathcal{N}(\mu_k,\Sigma_k)$
Need to learn $K$ sets of parameters $\mu_k,\Sigma_k$, for $k=1\dotsK$.
There is sensitivity to the Gaussian assumption. Due to $O(d^2)$ parameters per class, this can scale poorly as d increases. In practice, in high-dimensional problems you can assume that the $\Sigma_k$s are diagonal.

Naive Bayes Example

In Naive Bayes Classification, you model $p(x|c=k,\theta_k)$.

Markov Model Example

With sequence data, can do classification by learning a Markov Model for each class.
$p(s_i|c_i=k,\theta_k) = \prod_{j=2}^L_i p(s_{ij}|s_{i,j-1},\theta_k)$

  • * $s_i$ is sequence number $i$
  • * $p(s_{ij}|s_{i,j-1},\theta_k)$ is the transition probability from $j-1$ to $j$.
Bayesian Estimation

Treat the parameters $\theta$ as random variables.
In particular, before observing any data, there is the prior $p(\theta)$, a prior density for $\theta$.

As more data is gathered, the role of the prior is reduced. It is more influential when data is limited.

 \begin{align*}
    *   p(\theta|D) &= \frac{p(d|\theta)p(\theta)}{p(D)}
\\             &\propto p(d|\theta)p(\theta)
\\             &\propto \text{Likelihood x Prior}
\end{align*}

$\frac{p(d|\theta)p(\theta)}{p(d)}$ is known as the posterior density. In comparison to Maximum Likelihood Estimation:

\begin{tabular}{c c l @{\ \ $\Rightarrow$\ \ } l}
Maximum Likelihood Estimation & ML & $\hat \theta_{ML}$ & argmax_{\theta} p(D|theta) \\
Posterior Mode & MAP maximum a posteriori & $\hat \theta_{MAP}$ & argmax_\theta p(\theta|D) \\
Posterior Mean & MPE mean posterior estimate & $\hat \theta_{MPE}$ & $E_{p(\theta|D)} [\theta]$ \\
Full posterior density & & & 
\end{tabular}

Gaussian Fish Example

$\theta$ = mean weight of fish in a lake, assuming Gaussian likelihood.

$$ p(\theta) = N(\mu, \sigma^2)$$

Here $\mu$ is the mean of the prior and $\sigma^2$ is variance of the prior.

Bernoulli Parameter Example

$D = \{x_1, \dots , x_N\}$ , $x_i \in \{0,1\}$ , $p(x_i=1) = \theta$.

$L(\theta) = \theta^r (1 - \theta)^{N-r}$ , $r = \sum_{i=1}^N I(x_i=1)$

A common choice for a prior on $\theta$ is the Beta density: $p(\theta) = Beta(\theta|\alpha,\beta)$.
\begin{align*}
    *   Beta(\theta|\alpha,\beta) &= \frac{\gamma(\alpha+\beta)}{\gamma(\alpha)\gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1}
\\ &\propto \theta^{\alpha-1} (1-\theta)^{\beta-1}
\end{align*}

Properties of the Beta Prior

 \begin{align*}
    *   E_{p(\theta)}[\theta] &= \frac{\alpha}{\alpha + \beta}
\\ \text{mode}[\theta] &= \frac{\alpha-1}{\alpha+\beta-2}
\\ Var[\theta] &= \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}
\\ &\propto \frac{1}{2 \alpha + 1}
\\ &\propto \frac{1}{\alpha}
\end{align*}

Properties of the Posterior

The posterior is also in a Beta form due to conjugacy.

 \begin{align*}
    *   E_{p(\theta|D)}[\theta] &= \frac{\alpha'}{\alpha' + \beta'}
\\ &= \frac{r + \alpha}{N + \alpha + beta}
\end{align*}
The MPE effectively smooths the maximum likelihood estimate. The larger $\alpha$,$\beta$ are $\Rightarrow$ more smoothing. $\alpha$ and $\beta$ are referred to as pseudo counts because they effectively count the number of trials and successes of previous trials. $\alpha$ is the pseudocount for the number of successes, and $\alpha+\beta$ is the pseudocount for the prior number of trials.

As $N \to \infty$, MPE for $\theta \to \frac{r}{N} = \hat \theta_{ML}$. Variance of $p(\theta|D) \to 0$ as $N\to\infty$.

Conjugacy

The posterior density is the same form as the prior when using a conjugate prior. In this case, the Beta is a conjugate prior to the Bernoulli.

 \begin{align*}
    *   L(\theta) &= p(D|\theta) = \prod_{i=1}^N p(x_i|\theta) = \theta^r (1-\theta)^{N-r}
\\ p(\theta|D) &\propto p(D|\theta)p(\theta)
\\ &= \theta^r (1-\theta)^{N-r} \theta^{\alpha-1} (1-\theta)^{\beta-1}
\\ &= \theta^{r+\alpha-1} (1-\theta)^{N-r+\beta-1}
\\ &= \theta^{\alpha'-1} (1-\theta)^{\beta'-1}
\\ & & \alpha' = r+\alpha
\\ & & \beta' = N-r+\beta
\end{align*}

Multinomial Parameter Example

$D = \{x_1, \dots , x_N\}$ , $x_i \in \{1, \dots , K\}$, $K>2$, $p(x_i=1) = \theta$.
e.g. $x_i$ = occurrence of a word in a document. $K$ = number of unique words.
$r_k$ = number of $x_i$'s taking value $k$ in $D$. $\sum_{k=1}^K r_k = N$.
$\theta_k = p(x_i=k), \sum_{k=1}^N \theta_k = 1$.

 \begin{align*}
    *   L(\theta) = p(\theta|D) &= \prod_{i=1}^N p(x_i|\theta)
\\ &= \prod_{k=1}^K \theta_k^{r_k}
\\ \hat \theta_{ML} &= \frac{r_k}{N}
\end{align*}

The maximum likelihood in this case may require smoothing if we don't want it assigning zero probabilities.

A conjugate prior to the multinomial is the Dirichlet distribution. This is a generalization of the Beta to higher dimensions.

\begin{align*}
    *   p(\theta) &= Dirichlet(\theta|\alpha)
\\ &\propto \prod_{k=1}^K \theta_k^{\alpha_k-1} & \alpha_k > 0
\end{align*}

The $\alpha$ parameters are directly analogous to the Beta Binomial prior parameters.
e.g. for text, the $\alpha$s could be proportional to frequency of words in English text.

The posterior density will have the form $p(\theta|D) = Dirichlet(\alpha')$ where $\alpha_k' = \alpha_k + r_k$.

The prior mean $E_{p(\theta)} [\theta_k] = \frac{\alpha_k}{sum \alpha_k}$.
The posterior mean $E_{p(\theta|D)} [\theta_k] = \frac{r_k+\alpha_k}{N + sum \alpha_k}$.

Gaussian Parameter Example

Common in tracking problems. Assume that the movement of the object has some Gaussian noise to it.

$D = \{x_1, \dots , x_N\}$ , $x_i \in \{\mathbb{R}\}$, assuming $x_i \sim \mathcal{N}(\mu,\sigma^2)$, and $x_i$s are conditionally independent given $\theta$.

Known Variance

$L(\mu) \propto \prod_{i=1}^N e^{\frac{(x_i-\mu)^2}{2\sigma^2}}$

The conjugate prior is Gaussian.

$p(\mu) = N(\mu_0,s^2)$ where $\mu_0$ is the mean of the prior, and $s^2$ represents uncertainty about the prior.

Posterior
\begin{align*}
    *   p(\mu|D) &\propto p(D|\mu)p(\mu)
\\ &\propto \left( \prod_{i=1}^N e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \right) e^{-\frac{(\mu-\mu_0)^2}{2 s^2}}
\\ &\propto e^{-\frac{1}{2} \left[ \sum_{i=1}^N \left( \frac{x_i-\mu}{\sigma}\right)^2 + \left( \frac{\mu-\mu_0}{s}\right)^2 \right]}
\\ &\propto e^{-\frac{1}{2} (a\mu^2 + b\mu + c) } & \text{for some } a,b,c
\\ &\propto e^{-\frac{1}{2} \left( \frac{\mu-\mu_N}{\sigma_N}\right)^2} & \text{where\ } \mu_N = \gamma \hat \mu_{ML} + (1-\gamma)\mu_0
\\ & & \hat \mu_{ML} = \frac{1}{N} \sum_{i=1}^N x_i
\\ & & \gamma = \frac{N s^2}{Ns^2 + \sigma^2}
\\ \frac{1}{\sigma_N^2} &= \frac{N}{\sigma^2} + \frac{1}{s^2}
\end{align*}