Michael Data

Problems
Preprocessing

Text preprocessing decisions:

  • * punctuation
  • * apostrophes
  • aren't
  • * hyphens
  • clear-headed
  • * Acronyms
  • Can be expanded with a look-up table
  • * XML tags
  • * Programming-language-specific

Open-source parser compiling tools: ANTLR, JFlex, JavaCC.

Stop words: very common words that are not useful for non-statistical techniques. Not necessary when certain statistical techniques are used.

Normalization: “canonicalize” tokens to remove superficial differences. USA → U.S.A. → usa. C.A.T. → cat

Tokenization

Convert a doument into bag of words word counts. The common definition of a token is “any nonempty sequence of characters”.

Stemming

Want to reduce all morphological variants of a word to a single term in order to reduce the dimensionality of the feature space.

  • * Reduce verbs to an infinitive or stem form.
  • * Remove the suffix from plural and singular forms.

Not a perfect technique, and some information is lost. e.g. in the popular Porter stemming algorithm, both “university” and “universal” become “univers”.