Michael Data

(check Smyth Notes http://www.ics.uci.edu/~smyth/courses/cs277/public_slides/text_classification.pdf)

Applications

Spam email detection

Tagging of news articles

  • * e.g. Google News

Classifying web pages into categories

  • * Classification is used for allocation of advertisements
Data Representation

Most common is bag of words, which is counts or binary counts of words for each document. Bigrams or longer phrases are less frequently used.

Classes themselves can vary. They may be general categories e.g. the “Twenty Newsgroups” or a large number of specific “tag” labels. In Sentiment Analysis, labels may be opinions on a person or product e.g. “Like” “hate” “neutral”.

Some applications have multiple labels for single documents. Labels may also have a hierarchical relationship. Classifiers which take advantage of the hierarchy typically do better. The multi-label problems are an area which can use further progress.

Hard Labels

May be manually assigned from a predefined dictionary or human labelers. The human labelers may be domain experts, librarians/editors, or low-paid labelers via Amazon Turk.

In this case, there is still some disagreement among the domain experts. The agreement among experts is a sign of the difficulty of the classification problem. This can be measured as inter-annotator agreement.

Labels can also be generated with the help of semi-supervised learning, where a bootstrapping approach is used to generate more data labels.

Notation

Data typically is in the form of N documents comprised of d terms. Counts are stored in an N-by-d sparse matrix, where each row is a “bag of words” for a document.

Variables:

Probabilistic Classification:

  • * Learn a model for $P(c_k|x)$, the conditional probability of class $c_k$ given $x$.
  • * Note that by Bayes rule, $P(c_k|x) = P(x|c_k)p(c_k)/p(x)$
Classification Methods

Want to model $P(\underline{x} | c_k)$ for each class $k$ and perform classification via Bayes' rule. For 0-1 loss select the class $c_k$ that maximizes $P(c_k|\underline{x})$. By Bayes' rule, this is the same $c_k$ that maximizes $P(\underline{x} | c_k) p(c_k)$.

Models

Naive Bayes Classification

  • * is widely used for spam email.
  • * Still a popular baseline for comparison

Multinomial Classification

  • * Model a document with N words as N independent tosses of a d-sided die.

Markov Chains

  • * Not bag-of-words
  • * Model probability of sequences of text

Methods

Support Vector Machines

  • * Often most accurate in research
  • * Computationally complex and so don't scale well. Not widely used in practice

Logistic Regression

  • * Widely used in industry.
  • * Competetive with SVMs.

Predicting Class Labels

Need to compute $p(c|x) \propto p(x|c) p(c)$.

For both the Naive Bayes Bernoulli and multinomial models, the term $p(c|x)$ is the product of many probabilities and may result in underflow issues during computation.
In practice, $\logp(c|x) \propto \log p(x|c) + \log p(c)$ is used instead.

Feature Selection for Text Classification

The number of words used in the model is important because many are uninformative.
To choose words to use, it is common to start with an empty set then incrementally add words using high mutual information or some other heuristic.