Michael's Wiki

Mixture Models for Binary and Count Data

Binary and Count Data

$x_i = (x_{i1}, \dots x_{id})$ is a d-dimensional vector, where $x_{ij}$ corresponds to the conditional probability of $j$ given $i$.

Typically, the data matrix will be sparse.

A mixture model for such data: $p(x) = \sum_{k=1}^k p_k(x|z_k=1,\phi_k)p(z_k)$ where $\phi_k$ are the parameters of the k-th component.

Mixture of Multinomials

$p(x) = \sum_{k=1}^k p_k(x|z_k=1,\phi_k)p(z_k)$ $\phi_k = (\phi_{k1},\phi_{k2},\dots \phi_{kd}) = 1$ Each component is a multinomial, like a die with d sides.

Generative Model

  • * For each document $i$,
  • * n_i is the total word count for the document (assume it's known)
  • * sample k from $p(z_k)$
  • * for r = 1 to n_i,
  • sample $w_r$ from $(\phi_{k1},\phi_{k2},\dots \phi_{kd})$
  • * end
  • * $x_{ij}$ = # of times word j was sampled

Mixture of Conditional Independence Models

For simplicity, assume $x_i$ are binary. $\phi_k = (\phi_{k1},\phi_{k2},\dots \phi_{kd})$, but now $\phi_{kj} = p(X_{ij}=1)$ and they do not sum to 1. Now each word is a conditionally independent binary random variable.

Generative Model

  • * For each document $i$,
  • * sample k from $p(z_k)$
  • * for j = 1 to d,
  • $x_{ij}$ is sampled from $\phi_{kj}$
  • * end

Latent Dirichlet Allocation Model

[http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Latent Dirichlet Allocation], LDA, or Topic Model: Model that allows each word to belong to a different multinomial component.

EM doesn't work very well for this. The major difference is that Z is now at the word level. Allows for documents to be mixtures of topics.

Hidden Markov Models

Learning the Parameters

Use the Expectation Maximization algorithm.

  1. In the E-step, use Bayes' rule to compute membership probabilities


  1. In the M-step, maximize the log-likelihood of the data

#* $N_k = \sum_{i=1}^N w_{ik}, \alpha_k^{new} = \frac{N_k}{N}$ #* For multinomial model, $\phi_i^{new} = \frac{\sum_{i=1}^N w_{ik}x_{ij}}{\sum_{i=1}^N w_{ik} n_i}$ # #* For conditional independence model, $\phi_i^{new} = \frac{1}{N_k} \sum_{i = 1}^N w_{ik} x_{ij}$ # $x_{ij}$ acts as an indicator variable


  • * Can place a Beta prior over parameters while learning. This helps to deal with missing data.
  • For the multinomial model, probably use a Dirichlet prior
  • For the CI model, probably use a Beta prior