Classic model approach with roots in biology research.

$x = \begin{vmatrix} x_1

\vdots

x_d \end{vmatrix} $

$p(x) = \sum_{k=1}^K p_k(x|z_k,\theta_k) p(z_k) $ $p_k$ is called the kth mixture component. z can be a binary indicator vector. Only one $z_k$ equals one. Then let $\alpha$ be a mixture distribution over z. $\alpha_k = p(z_k)$

example: mixture of arbitrary densities of distributions $k=1$ indicates a Gaussian $k=2$ indicates an exponential $k=3$ indicates a Gamma

example: mixture of Gaussians $\theta_k = {\mu_k,\Sigma_k}, P_k()$ s a Gaussian density with parameters $\theta_k$. For mixture weights $\alpha = {0.6,0.4}$, $p(x) = 0.6 p_1(x|\dots) + 0.4 p_2(x|dots)$

example: mixture of conditionally independent Bernoulli trials $p_k(x|z_k,\theta_k) = \prod_{j = 1}^d p_k(x_j=1)^{x_j} * (1 - p_k(x_j=1))^{1-x_j} $

They are a very flexible approach to density estimation. Allow writing a complex density as a combination of simpler ones. Appropriate when you want to model systems with real physical component phenomena.

Assume that each data point is generated from only a single component.

- Generative model.
- for i=1 to N:
- k* ← sample a component for ith data point ~ p(z=k)
- xi ← sample a data point from compenent k* ~ p_k(x|\theta_k,z=k*)

$\underline{\theta} = \{\underline{\theta_1} \dots \underline{\theta}_k,\alpha_1 \dots \alpha_k\}$

- $\underline{\theta}_k$ are component parameters
- $\alpha_k = p(z=k)$

<latex>\begin{align*}
L(\theta) &= \log p(D|\theta)

&= \sum_{i=1}^N\log p(x_i|\theta)

&= \sum_{i = 1}^N \log \left( \sum_{k = 1}^K p(x_i|z_k,\theta_k) p(z_k) \right)
\end{align*}</latex>

The problem with this approach is that the summation over unknown $z$ values is intractable in even simple cases.

K-Means is the non-probabilistic version of EM.

Expectation Maximization is typically used to solve these problems

Kernel Density Estimation can work well in low dimensions but doesn't scale well.