# Michael's Wiki

Machine Learning is a class of techniques for learning a model of data, frequently used in data mining and business intelligence.

Learning refers to learning the parameters of a model. It is also referred to as inference in Bayesian methods. Once a model has been learned, it might be used for predictive modeling or exploratory data analysis.

In a generative model, there is also usually a way to “go back” and generate or simulate data by fixing the parameters.

From a Bayesian framework, we would want to estimate something like $P(\theta|D) \propto P(D|\theta) P(\theta)$. The prior probability over $\theta$ can be controvercial, so ignoring it or using a flat prior, the likelihood alone is typically used to carry the parameter estimation.

#### Conditional Likelihood

$D = \{(x_i,c_i)\}$ for $i=1\dots N$, $\theta_i \in \{0,d\}$ and $c_i \in \{0,1\}$. Can write likelihood as $\prod_{i=1}^N p(x_i,c_i)$ using conditional independence.

<latex>\begin{align*}

• L(\theta) &= \prod_{i=1}^N p(c_i|x_i,\theta) p(x_i) & \text{but will ignore the $p(x_i)$ term}

p(c_i = 1|x_i,\theta) &= \prod_{i=1}^N p(c_i|x_i,\theta) & \text{assuming CI} \end{align*}</latex>

This is a conditional model, but hasn't defined the actual probabilities yet. The one which is most commonly used in machine learning is the logistic function: $$p(c_i=1|x_i) = \frac{1}{1 + e^{-(w^t x_i + w_0)}}$$ $$p(c=1|x,\alpha) = \frac{1}{1 + e^{-(\alpha_0 + \sum_{j=1}^d \alpha_j x_j)}}$$

It defines a linear decision boundary. This is logistic regression, an example of a generalized linear model.

##### Generative Approach

<Merge with Classification>

Learn a model for a joint distribution $p(x_i,c_i)$ to predict class label of a new vector $x^*$. We can compute $p(c=k|x^* , \hat\theta_{ML} , \hat \alpha_{ML})$ using Bayes' rule.

$D = \{(\underline{x}_i,c_i)\}$. <latex>\begin{align*} L(\theta) = p(D|\theta) &= \prod_{i=1}^N p(x_i|c_i,\theta)p(c_i) &= \prod_{k=1}^N \left[ \prod_{c_i=k} p(x_i|c_i=k,\theta_k)p(c_i=k) \right] \end{align*}</latex> Key Points • * Learn a model for how thex$s are distriubted for each class • i.e.$p(x|c=k,\theta_k)$uses parameters for class$k$• requires a distributional/parametric assumption • e.g. Gaussian multivariate model • * Also have to learn$p(c=k)$values, though this is easy • * Likelihood decomposes into$k$separate optimization problems if$\theta_k$s are unrelated • * This approach is theoretically optimal if: • The distributional assumptions are correct • we can learn the true/optimal parameters • * Predict using Bayes' rule:$P(c=k|x,\theta) \propto p(x|c=k,\theta_k)p(c=k)$#### Gaussian Example$p(x|c=k,\theta_k) = \mathcal{N}(\mu_k,\Sigma_k)$Need to learn$K$sets of parameters$\mu_k,\Sigma_k$, for$k=1\dotsK$. There is sensitivity to the Gaussian assumption. Due to$O(d^2)$parameters per class, this can scale poorly as d increases. In practice, in high-dimensional problems you can assume that the$\Sigma_k$s are diagonal. #### Naive Bayes Example In Naive Bayes Classification, you model$p(x|c=k,\theta_k)$. #### Markov Model Example With sequence data, can do classification by learning a Markov Model for each class.$p(s_i|c_i=k,\theta_k) = \prod_{j=2}^L_i p(s_{ij}|s_{i,j-1},\theta_k)$• *$s_i$is sequence number$i$• *$p(s_{ij}|s_{i,j-1},\theta_k)$is the transition probability from$j-1$to$j$. ##### Bayesian Estimation Treat the parameters$\theta$as random variables. In particular, before observing any data, there is the prior$p(\theta)$, a prior density for$\theta. As more data is gathered, the role of the prior is reduced. It is more influential when data is limited. <latex> \begin{align*} • p(\theta|D) &= \frac{p(d|\theta)p(\theta)}{p(D)} &\propto p(d|\theta)p(\theta) &\propto \text{Likelihood x Prior} \end{align*}</latex>\frac{p(d|\theta)p(\theta)}{p(d)}$is known as the posterior density. In comparison to Maximum Likelihood Estimation: <latex> \begin{tabular}{c c l @{\ \$\Rightarrow$\ \ } l} Maximum Likelihood Estimation & ML &$\hat \theta_{ML}$& argmax_{\theta} p(D|theta) Posterior Mode & MAP maximum a posteriori &$\hat \theta_{MAP}$& argmax_\theta p(\theta|D) Posterior Mean & MPE mean posterior estimate &$\hat \theta_{MPE}$&$E_{p(\theta|D)} [\theta]$Full posterior density & & & \end{tabular} </latex> #### Gaussian Fish Example$\theta$= mean weight of fish in a lake, assuming Gaussian likelihood. $$p(\theta) = N(\mu, \sigma^2)$$ Here$\mu$is the mean of the prior and$\sigma^2$is variance of the prior. #### Bernoulli Parameter Example$D = \{x_1, \dots , x_N\}$,$x_i \in \{0,1\}$,$p(x_i=1) = \theta$.$L(\theta) = \theta^r (1 - \theta)^{N-r}$,$r = \sum_{i=1}^N I(x_i=1)$A common choice for a prior on$\theta$is the Beta density:$p(\theta) = Beta(\theta|\alpha,\beta). <latex>\begin{align*} • Beta(\theta|\alpha,\beta) &= \frac{\gamma(\alpha+\beta)}{\gamma(\alpha)\gamma(\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} &\propto \theta^{\alpha-1} (1-\theta)^{\beta-1} \end{align*}</latex> ### Properties of the Beta Prior <latex> \begin{align*} • E_{p(\theta)}[\theta] &= \frac{\alpha}{\alpha + \beta} \text{mode}[\theta] &= \frac{\alpha-1}{\alpha+\beta-2} Var[\theta] &= \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} &\propto \frac{1}{2 \alpha + 1} &\propto \frac{1}{\alpha} \end{align*}</latex> ### Properties of the Posterior The posterior is also in a Beta form due to conjugacy. <latex> \begin{align*} • E_{p(\theta|D)}[\theta] &= \frac{\alpha'}{\alpha' + \beta'} &= \frac{r + \alpha}{N + \alpha + beta} \end{align*}</latex> The MPE effectively smooths the maximum likelihood estimate. The larger\alpha$,$\beta$are$\Rightarrow$more smoothing.$\alpha$and$\beta$are referred to as pseudo counts because they effectively count the number of trials and successes of previous trials.$\alpha$is the pseudocount for the number of successes, and$\alpha+\beta$is the pseudocount for the prior number of trials. As$N \to \infty$, MPE for$\theta \to \frac{r}{N} = \hat \theta_{ML}$. Variance of$p(\theta|D) \to 0$as$N\to\infty. ## Conjugacy The posterior density is the same form as the prior when using a conjugate prior. In this case, the Beta is a conjugate prior to the Bernoulli. <latex> \begin{align*} • L(\theta) &= p(D|\theta) = \prod_{i=1}^N p(x_i|\theta) = \theta^r (1-\theta)^{N-r} p(\theta|D) &\propto p(D|\theta)p(\theta) &= \theta^r (1-\theta)^{N-r} \theta^{\alpha-1} (1-\theta)^{\beta-1} &= \theta^{r+\alpha-1} (1-\theta)^{N-r+\beta-1} &= \theta^{\alpha'-1} (1-\theta)^{\beta'-1} & & \alpha' = r+\alpha & & \beta' = N-r+\beta \end{align*}</latex> #### Multinomial Parameter ExampleD = \{x_1, \dots , x_N\}$,$x_i \in \{1, \dots , K\}$,$K>2$,$p(x_i=1) = \theta$. e.g.$x_i$= occurrence of a word in a document.$K$= number of unique words.$r_k$= number of$x_i$'s taking value$k$in$D$.$\sum_{k=1}^K r_k = N$.$\theta_k = p(x_i=k), \sum_{k=1}^N \theta_k = 1. <latex> \begin{align*} • L(\theta) = p(\theta|D) &= \prod_{i=1}^N p(x_i|\theta) &= \prod_{k=1}^K \theta_k^{r_k} \hat \theta_{ML} &= \frac{r_k}{N} \end{align*}</latex> The maximum likelihood in this case may require smoothing if we don't want it assigning zero probabilities. A conjugate prior to the multinomial is the Dirichlet distribution. This is a generalization of the Beta to higher dimensions. <latex>\begin{align*} • p(\theta) &= Dirichlet(\theta|\alpha) &\propto \prod_{k=1}^K \theta_k^{\alpha_k-1} & \alpha_k > 0 \end{align*}</latex> The\alpha$parameters are directly analogous to the Beta Binomial prior parameters. e.g. for text, the$\alpha$s could be proportional to frequency of words in English text. The posterior density will have the form$p(\theta|D) = Dirichlet(\alpha')$where$\alpha_k' = \alpha_k + r_k$. The prior mean$E_{p(\theta)} [\theta_k] = \frac{\alpha_k}{sum \alpha_k}$. The posterior mean$E_{p(\theta|D)} [\theta_k] = \frac{r_k+\alpha_k}{N + sum \alpha_k}$. #### Gaussian Parameter Example Common in tracking problems. Assume that the movement of the object has some Gaussian noise to it.$D = \{x_1, \dots , x_N\}$,$x_i \in \{\mathbb{R}\}$, assuming$x_i \sim \mathcal{N}(\mu,\sigma^2)$, and$x_i$s are conditionally independent given$\theta$. ### Known Variance$L(\mu) \propto \prod_{i=1}^N e^{\frac{(x_i-\mu)^2}{2\sigma^2}}$The conjugate prior is Gaussian.$p(\mu) = N(\mu_0,s^2)$where$\mu_0$is the mean of the prior, and$s^2\$ represents uncertainty about the prior.

Posterior <latex>\begin{align*}

• p(\mu|D) &\propto p(D|\mu)p(\mu)

&\propto \left( \prod_{i=1}^N e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \right) e^{-\frac{(\mu-\mu_0)^2}{2 s^2}}
&\propto e^{-\frac{1}{2} \left[ \sum_{i=1}^N \left( \frac{x_i-\mu}{\sigma}\right)^2 + \left( \frac{\mu-\mu_0}{s}\right)^2 \right]}
&\propto e^{-\frac{1}{2} (a\mu^2 + b\mu + c) } & \text{for some } a,b,c
&\propto e^{-\frac{1}{2} \left( \frac{\mu-\mu_N}{\sigma_N}\right)^2} & \text{where\ } \mu_N = \gamma \hat \mu_{ML} + (1-\gamma)\mu_0
& & \hat \mu_{ML} = \frac{1}{N} \sum_{i=1}^N x_i
& & \gamma = \frac{N s^2}{Ns^2 + \sigma^2}
\frac{1}{\sigma_N^2} &= \frac{N}{\sigma^2} + \frac{1}{s^2} \end{align*}</latex>