is a d-dimensional vector, where corresponds to the conditional probability of given .

Typically, the data matrix will be sparse.

A mixture model for such data: where are the parameters of the k-th component.

Each component is a multinomial, like a die with d sides.

- * For each document ,
- * n_i is the total word count for the document (assume it's known)
- * sample k from
- * for r = 1 to n_i,
- sample from
- * end
- * = # of times word j was sampled

For simplicity, assume are binary.

, but now and they do not sum to 1.

Now each word is a conditionally independent binary random variable.

- * For each document ,
- * sample k from
- * for j = 1 to d,
- is sampled from
- * end

[http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Latent Dirichlet Allocation], **LDA**, or **Topic Model**: Model that allows each word to belong to a different multinomial component.

EM doesn't work very well for this. The major difference is that Z is now at the word level. Allows for documents to be mixtures of topics.

Use the Expectation Maximization algorithm.

- In the E-step, use Bayes' rule to compute membership probabilities

#*

- In the M-step, maximize the log-likelihood of the data

#*

#* For multinomial model,

#**
#* For conditional independence model, $\phi_i^{new} = \frac{1}{N_k} \sum_{i = 1}^N w_{ik} x_{ij}$
#** acts as an indicator variable

- * Can place a Beta prior over parameters while learning. This helps to deal with missing data.
- For the multinomial model, probably use a Dirichlet prior
- For the CI model, probably use a Beta prior