# Michael's Wiki

Classic model approach with roots in biology research.

##### Notation

$x = \begin{vmatrix} x_1 \vdots x_d \end{vmatrix}$

$p(x) = \sum_{k=1}^K p_k(x|z_k,\theta_k) p(z_k)$ $p_k$ is called the kth mixture component. z can be a binary indicator vector. Only one $z_k$ equals one. Then let $\alpha$ be a mixture distribution over z. $\alpha_k = p(z_k)$

example: mixture of arbitrary densities of distributions $k=1$ indicates a Gaussian $k=2$ indicates an exponential $k=3$ indicates a Gamma

example: mixture of Gaussians $\theta_k = {\mu_k,\Sigma_k}, P_k()$ s a Gaussian density with parameters $\theta_k$. For mixture weights $\alpha = {0.6,0.4}$, $p(x) = 0.6 p_1(x|\dots) + 0.4 p_2(x|dots)$

example: mixture of conditionally independent Bernoulli trials $p_k(x|z_k,\theta_k) = \prod_{j = 1}^d p_k(x_j=1)^{x_j} * (1 - p_k(x_j=1))^{1-x_j}$

##### Applications of Mixture Models

They are a very flexible approach to density estimation. Allow writing a complex density as a combination of simpler ones. Appropriate when you want to model systems with real physical component phenomena.

##### Learning

Assume that each data point is generated from only a single component.

• Generative model.
• for i=1 to N:
• k* ← sample a component for ith data point ~ p(z=k)
• xi ← sample a data point from compenent k* ~ p_k(x|\theta_k,z=k*)

#### Maximum Likelihood

$\underline{\theta} = \{\underline{\theta_1} \dots \underline{\theta}_k,\alpha_1 \dots \alpha_k\}$

• $\underline{\theta}_k$ are component parameters
• $\alpha_k = p(z=k)$

<latex>\begin{align*} L(\theta) &= \log p(D|\theta)
&= \sum_{i=1}^N\log p(x_i|\theta)
&= \sum_{i = 1}^N \log \left( \sum_{k = 1}^K p(x_i|z_k,\theta_k) p(z_k) \right) \end{align*}</latex>

The problem with this approach is that the summation over unknown $z$ values is intractable in even simple cases.

#### K-Means

K-Means is the non-probabilistic version of EM.

#### Expectation Maximization

Expectation Maximization is typically used to solve these problems

#### Kernel Density Estimation

Kernel Density Estimation can work well in low dimensions but doesn't scale well.