Want a mapping from features to class labels, from to .

We will focus on assuming the parameters of the classification model are known.

**generative models** = model (e.g. naive Bayes)

- * Requires assumptions about the underlying distributions

**regression approach** = model directly (e.g. logistic regression)

- * More widely used, doesn't have to assume knowledge of the underlying distribution. Useful when ranking is desired.

**look for decision boundaries directly** = no probabilities. Just find boundaries that minimize loss. (e.g. decision trees)

- * Useful when a decision needs to be made but confidence is not wanted.

Here, we refer to as the **discriminant function**. defines a decision boundary in x-space. The decision boundary will be linear if both discriminant functions are linear.

For each class , compute discriminant function. Then to make a prediction, select . Examples of discriminant functions: or or , which each give the same prediction. These are all optimal if the functions are valid.

Linear discriminants can be written in the frm where is a weight vector. This is 'linear' because the number of parameters is linear to the dimensions of the data.

For , decision region is defined as the region where the discriminant function returns a particular class. The boundary between any two regions can be found by solving for where their respective discriminant functions are equal. For two linear discriminant functions, the decision boundary will also be linear.

See section 4.2 of the text or Gaussian Classification Decision Boundary.

Assume a gaussian model for each class, and that the parameters have already been estimated.

The first term is quadratic in . The second in linear and the third is independent of .

Special case: When each class has a common covariance matrix, the decision boundary will be linear. So the first term of the discriminant function above can be ignored to do classification.

This is sometimes referred to as **linear discriminant analysis**.

A non-probabilistic classifier would miss the class priors.

Can think of a classifier as a map from data x to predicted class labels . The input space can then be broken up into **decision regions** corresponding to the class assigned to data from each region.

An **optimal classifier** minimizes the `expected loss`

of future predictions.

Let be the `loss`

incurred by predictions. A common scheme is the 0-1 loss, which basically just counts the number of errors without regard to the degree of the error. Depending on the problem, may need to distinguish between error degree or type-1 and type-2 errors.

Choosing optimally:

For , we have .

Expected loss for a single prediction will be .

Unless there is some kind of adversarial/game-theoretic issue, there is no advantage to randomizing this prediction. Should choose the minimum expected loss.

Overall expected loss:

If the function used to make predictions is limited, may want to focus on places where data tends to occur:

For 0-1 loss functions, .

This is called the **Bayes error rate**. In practice, it is almost always greater than zero.

where are class labels.

Model both and . The problem tends to be estimating the likelihood for high-dimensional data. One reason this is difficult is the difficulty of processing so much data. Another issue is that many more assumptions tend to be made about the independence of each dimension.

Likelihood =

Can optimize for each class separately, given independence assumptions.

Model directly, and avoid .

Likelihood =

Optimize the conditional likelihood, ignore the prior.