See Notes on Logistic Regression by Charles Elkan.

Considered `discriminative`

: As opposed to generative models, computes $p(c_i|x_i)$ directly.

Very widely used. The linear function is the probability model, and the $f$ function is a transformation to enforce constraints of a useful probability. Logistic Model: $p(c=1|x) = f(x^t x + w_0)$, where $f(z) = \frac{1}{1 + e^{-z}}$. $ p(c_i|x_i) = \frac{1}{1+e^{-(w^t x_i + x_0)}} $

$ p(c_i|x_i) = \frac{1}{1+e^{-(\sum_{j=1}^d \beta_jx_j)}} $ with weights $\beta_1 \dots \beta_d$. Can also be interpreted as a linear weighted sum of the inputs

This defines a linear decision boundary. It can also be written as $w^t x + w_0 = \log \left( \frac{p(c=1|x)}{1-p(c=1|x)} \right)$, which is known as a **log-odds** function. It converts the result from the range [0,1] to the range [-inf, +inf].

This is the same general form as a Gaussian classification model using equal covariances. Unlike the Gaussian model, which assumes a model for each class, the weights of the logistic model are unconstrained and may move freely. A perceptron classificatio model also uses a linear decision function, but with a single threshold. In these ways, the logistic classifier is a generalization of perceptrons and Gaussian classifiers.

The “log-loss function” gives appropriate log-loss to each class: $ \log L(\theta) = \sum_{i=1}^N c_i \log f(x_i;\theta) + (1-c_i) \log \left(1-f(x_i;\theta)\right) $ This is is considered a more principled loss function than MSE, but doesn't always perform differently in practice.

Let $f(x_i) = \frac{1}{1 + e^{-w^t x_i}}$. (The $w_0$ has been absorbed into the xs)

Likelihood $ = \prod_{i=1}^N p(c_i|x_i) = \prod_{i=1}^N f(x_i|w)^{c_i} (1 - f(x_i|w))^{1-c_i}$ Log-Likelihood $ = \sum_{i=1}^N c_i \log f(x_i|w) + (1-c_i) \log(1 - f(x_i|w))$

Now the goal is to maximize this Log-Likelihood with respect to the weights. There is typically no closed-form solution to solving for the weights. This function is concave, which means there is a single global maximum. Solve it using an iterative gradient-based search:

This method increases the opportunity for overfitting.

Replace x by mapping it to a non-linear feature space: $x \to [x_1, x_1^2, x_1 x_2, x_2, x_2^2, \dots ]$ The features are replaced by some function of them $\phi(x)$

Logistic regression then learns a linear model `in this space`

, which may be a non-linear model in the original feature space. You typically wouldn't do this in high dimensions, but this can help when separating data is a problem.

Want to learn a general model which will be effective on future/unseen data. In order to avoid overfitting to the training data, want to have some kind of penalty for unnecessarily large weights.

a.k.a. **ridge regression**.
Instead of maximizing log-likelihood, $\text{maximize}_{w} \log L(w) - \lambda \sum_{j=1}^d w_j^2 $ In this case, weights “have to justify their existence in the model”. The lambda term pressures weights to be as small as they can. Lambda can be set through cross-validation.

a.k.a. the “Lasso method” More inclined to drive weights to zero faster than L1. By identifying a smaller set of predictors, it can aid in interpreting weights. $maximize_{w} \log L(w) - \lambda \sum_{j=1}^d |w_j| $

MAP methods: $maximize_{w} \log L(w) + \log p(w)$ Now the penalty term is basically a Bayesian prior. L2 regularization corresponds to a Gaussian prior with mean zero. L1 regularization corresponds to a Laplacian prior with mean zero.

These methods don't average over weights, which could be veneficial for interpreting the weights.

Say we have a binary classification problem: $y_i \in \{0,1\}$. Can train a classifier using regression techniques and MSE as the loss function.

$$ E[y|x] = \sum_y y p(y|x) = 1 * p(y=1|x) + 0 * p(y=0|x) = p(y=1|x) $$

$$p(c=1|x) = \frac{1}{1 + e^{w x + w_0}}$$

As $x \to \infty, p(c=1|x) \to 1$. As $x \to -\infty, p(c=1|x) \to 0$.

K classes, where $c_i \in {1,2,\dots K}$. $p(c=k|x,w) = \frac{e^{w_k^t x}}{\sum_{k=1}^K e^{w_k^t x}}$

Parameters: K weight vectors $w_k$ each dimension.

Learning algorithm: straightforward extensions of the binary case, but now there are additional subscripts.

This is more optimal than trying to independently learn $O(K^2)$ boundaries between each class.