Want a mapping from features to class labels, from $x$ to $C \in {c_1, \dots c_K}$.

We will focus on assuming the parameters of the classification model are known.

**generative models** = model $p(x|\theta) p(\theta)$ (e.g. naive Bayes)

- * Requires assumptions about the underlying distributions

**regression approach** = model $p(\theta|x)$ directly (e.g. logistic regression)

- * More widely used, doesn't have to assume knowledge of the underlying distribution. Useful when ranking is desired.

**look for decision boundaries directly** = no probabilities. Just find boundaries that minimize loss. (e.g. decision trees)

- * Useful when a decision needs to be made but confidence is not wanted.

<latex>\begin{align*}

- \text{Compute} &\text{ arg max}_k p(c=k|x)

\text{or} &\text{ arg max}_k p(x|c=k)p(c=k)

\text{or} &\text{ arg max}_k \log p(x|c=k) + \log p(c=k)
\end{align*}</latex>

Here, we refer to $\log p(x|c=k) + \log p(c=k) = g_k(x)$ as the **discriminant function**. $g_1(x) - g_2(x) = 0$ defines a decision boundary in x-space. The decision boundary will be linear if both discriminant functions are linear.

For each class $c_1 \dots c_K$, compute $g_k(x) = k^{th}$ discriminant function. Then to make a prediction, select $\hat c_k = argmax_k g_k(x)$. Examples of discriminant functions: $g_k(x) = p(c_k|x)$ or $g_k(x) = p(x|c_k)P(c_k)$ or $g_k(x) = \log p(x|c_k) \log P(c_k)$, which each give the same prediction. These are all optimal if the $p$ functions are valid.

Linear discriminants can be written in the frm $g_k(x) = w_k^{+} x + \alpha_k$ where $w$ is a $dx1$ weight vector. This is 'linear' because the number of parameters is linear to the dimensions of the data.

For $x \in \mathbb{R}$, decision region is defined as the region where the discriminant function returns a particular class. The boundary between any two regions can be found by solving for where their respective discriminant functions are equal. For two linear discriminant functions, the decision boundary will also be linear.

See section 4.2 of the text or Gaussian Classification Decision Boundary. Assume a gaussian model for each class, and that the parameters have already been estimated. <latex>\begin{align*}

- p(x|c) &\sim \mathcal{N}

g_k(x) &= \log p(x|c_k) + \log p(c_k)

\log p(x|c_k) &= -\frac{1}{2} \log |\Epsilon_k| - \frac{1}{2} (x-\mu_k)^T\Epsilon_k^{-1}(x-\mu_k)

&= - \frac{1}{2} x^T\Epsilon_k^{-1}x + \left(\Epsilon_k^{-1}\mu_k\right)^T x - \frac{1}{2} \mu_k^T \Epsilon_k^{-1} \mu_k
\end{align*}</latex>

The first term is quadratic in $x$. The second in linear and the third is independent of $x$.

Special case: When each class has a common covariance matrix, the decision boundary will be linear. So the first term of the discriminant function above can be ignored to do classification.

This is sometimes referred to as **linear discriminant analysis**.

A non-probabilistic classifier would miss the class priors.

Can think of a classifier as a map from data x to predicted class labels ${c_1,c_2,\dots c_k}$. The input space can then be broken up into **decision regions** corresponding to the class assigned to data from each region.
An **optimal classifier** minimizes the `expected loss`

of future predictions.

Let $\lambda_{kj} \geq 0$ be the `loss`

incurred by predictions. A common scheme is the 0-1 loss, which basically just counts the number of errors without regard to the degree of the error. Depending on the problem, may need to distinguish between error degree or type-1 and type-2 errors.

Choosing optimally: For $x$, we have $p(c_1|x), p(c_2|x),\dots p(c_k|x)$. Expected loss for a single prediction $c_k$ will be $\sum_{d=1}^K \lambda_{kd} p(c_d|x)$. Unless there is some kind of adversarial/game-theoretic issue, there is no advantage to randomizing this prediction. Should choose the minimum expected loss.

Overall expected loss: If the function used to make predictions is limited, may want to focus on places where data tends to occur: <latex>\begin{align*}

- E(\lambda) &= \int \lambda_{k^*}(x) p(x) dx

&= \int \sum_{d=1}^K \lambda_{kd} p(c_d|x) p(x) dx

&= \sum_{k=1}^K \int_{R_k} \sum_{d=1}^K \lambda_{kd} p(c_d|x) p(x) dx & \text{where decision region k} = R_k
\end{align*}</latex>

For 0-1 loss functions, $E(\lambda) = \sum_{k=1}^K \int_{R_k} (1- \max_k p(c_k|x) p(x) dx $.
This is called the **Bayes error rate**. In practice, it is almost always greater than zero.

$D = {x_i,c_i}$ where $c_i$ are class labels. Model both $p(x|c_k,\theta_k)$ and $p(c_k)$. The problem tends to be estimating the likelihood for high-dimensional data. One reason this is difficult is the difficulty of processing so much data. Another issue is that many more assumptions tend to be made about the independence of each dimension.

Likelihood = $\prod_{i=1}^N p(x_i|c_i)p(c_i)$ Can optimize for each class separately, given independence assumptions.

Model $p(c_k|x)$ directly, and avoid $p(x|c_k)$.

Likelihood = $\prod_{i=1}^N p(x_i|c_i)p(c_i)$ $ = \prod_{i=1}^N p(c_i|x_i)p(x_i)$ $ = \prod_{i=1}^N p(c_i|x_i)$ Optimize the conditional likelihood, ignore the prior.