The **Naive Bayes** assumption in classification is that data observations are independent of each other when conditioned on their class.

(see lecture notes http://www.ics.uci.edu/~smyth/courses/cs277/public_slides/text_classification.pdf)

- For each class
- For each feature
- P ( x_j = 1 | c_k ) = n_{jk} / n_k
- end
- P( c_k ) = n_k / N
- end

Complexity is linear in T, where T is the total number of word tokens: $O(T+kd)$.

Smoothing is very important to account for zero-counts in training data. A beta prior is common here. Without smoothing, the classifier can learn zero-probabilities for certain features. This rules out the possibility of that feature being associated with a class in the future.

…

Naive bayes was widely used in early spam email classifiers.