Michael Data

For text analysis of large text corpora, we frequently want to organize, summarize, index, or query them in some way. Identifying topic features can be an interpretable method for .

Topic Modeling Applications
  • * Web-indexing and exploration tools
  • * As part of the discovery process of some lawsuits, large numbers of emails will be provided and have to be sifted through.
  • e.g. 250,000 Enron emails were released as part of the company's scandal.
  • * Prediction and analysis of developing technologies
  • Using patent data or research publications
  • * Humanities and political science
  • “Modeling topic control”
Terminology

Input
*Bag-of-words: data format typically in form of a sparse matrix of word counts per document

  • Memoryless: doesn't take sequences of words into account, or prior words produced
  • * Vocabulary: the set of words present in the model and data
  • * Document: a group of words present in the bag-of-words data matrix

Output

  • * T topics, i.e. T topic-word probability vectors $\phi$
  • * N sets of topic weights, one per document $\theta$
  • * Assignment of each word in each document to 1 of T topics $z$
Statistical Topic Models

For count data, use a simple generative model for sparse counts in bag-of-words format. A document is modeled as a mixture model over topics. Each topic is a probability distribution over the vocabulary words.

The forward generative nature of this model means that it can generate sample documents, though the documents are created in a bag-of-words format and not practical for direct interpretation.

  • generate_document():
  • for each word in the document:
  • sample a topic from p(topics|document)
  • sample a word from p(words|topic)

This pseudocode is known as a forward mechanism. The model we want to learn is essentially inverting this code via Bayes' rule.

$p(w | \phi,z_d)$ is a multinomial distribution over words, given the current topic.
$p(doc | \phi,z_d) = \prod_{i=1}^{L_doc} p(w_i | \phi, z_{d,i})$
Via Bayes' rule: $p(corpus | \phi) = \prod p(doc | \phi, z_d)$.

$$ p(w_i|d) = \sum_{j=1}^T p(w_i|z_j)p(z_j|d) $$

See the Blei paper, page 4, for derivation of likelihood from the generative model.

Learning the Model

Can “easily” use Expectation Maximization for simple models where $Z$ is sampled once per document.

The most commonly used learning algorithm, Latent Dirichlet Allocation, uses multiple topics learned per document. It is typical to perform stochastic search with Gibbs sampling over the unkown variables (historically, also variational inference). In practice, you can just learn the $\underline{z}$ variables:

$$ p(z_i=t|z_{-i}) = \frac{n_{td}^{-1} + \alpha}{\sum_{t'} n_{t'd}^{-i} + T\alpha} * \frac{n_{wt}^{-i} + \beta}{\sum_{w'} n_{w't}^{-i} + W\beta} $$

  • * $n_{td}^{-1}$ is count of topic t assigned to doc d
  • * $n_{wt}^{-i}$ is …
Topics as Matrix Factorization

Directly analogous to principal components and factor models. Where principal components tries to minimize squared error, here we want to maximize the log-probability of observed data.

Go from matrix of docs-by-terms to factorized matrices of docs-by-topics and topics-by-terms.

Methodological Issues
Evaluation of Topic Models

Human inspection: get human experts to choose how many of top-10 words from each topic seem incoherent.

Information retrieval performance: can you use high-probabiligy topic terms as search queries to retrieve documents of that topic?

Feature selection in classification of documents.

Extensions to Topic Models

May add metadata or dependencies between documents.

Geographic information

Temporal Models

Which topics are “growing” over time? What percentage of the words produced within a given year came from that topic?

The Author-Topic Model

  • For each document:
  • For each word in the document:
  • Randomly select an author of the document
  • sample a topic from P(topics|author)
  • sample a word from the topic

See Steyvers et al, 2004 and Rosen-Zvi et al, 2010.

Topical N-Grams

Multilabel Classification

During training, associate each topic with a training label and restrict the sampler to the known labels for the document.

Heuristics

In practice, people tend to fix the number of topics in a heuristic manner rather than relying on non-parametric methods.

Distributed algorithms can approximate the update by allowing documents to be processed in parallel.

Stochastic methods provide more performance improvements. See Hoffman et. al 2013.