See Notes on Logistic Regression by Charles Elkan.

Considered `discriminative`

: As opposed to generative models, computes directly.

Very widely used. The linear function is the probability model, and the function is a transformation to enforce constraints of a useful probability.

Logistic Model: , where .

with weights .

Can also be interpreted as a linear weighted sum of the inputs

This defines a linear decision boundary. It can also be written as , which is known as a **log-odds** function. It converts the result from the range [0,1] to the range [-inf, +inf].

This is the same general form as a Gaussian classification model using equal covariances. Unlike the Gaussian model, which assumes a model for each class, the weights of the logistic model are unconstrained and may move freely. A perceptron classificatio model also uses a linear decision function, but with a single threshold. In these ways, the logistic classifier is a generalization of perceptrons and Gaussian classifiers.

The “log-loss function” gives appropriate log-loss to each class:

This is is considered a more principled loss function than MSE, but doesn't always perform differently in practice.

Let . (The has been absorbed into the xs)

Likelihood

Log-Likelihood

Now the goal is to maximize this Log-Likelihood with respect to the weights. There is typically no closed-form solution to solving for the weights. This function is concave, which means there is a single global maximum. Solve it using an iterative gradient-based search:

This method increases the opportunity for overfitting.

Replace x by mapping it to a non-linear feature space:

The features are replaced by some function of them

Logistic regression then learns a linear model `in this space`

, which may be a non-linear model in the original feature space. You typically wouldn't do this in high dimensions, but this can help when separating data is a problem.

Want to learn a general model which will be effective on future/unseen data.

In order to avoid overfitting to the training data, want to have some kind of penalty for unnecessarily large weights.

a.k.a. **ridge regression**.

Instead of maximizing log-likelihood, In this case, weights “have to justify their existence in the model”. The lambda term pressures weights to be as small as they can. Lambda can be set through cross-validation.

a.k.a. the “Lasso method”

More inclined to drive weights to zero faster than L1. By identifying a smaller set of predictors, it can aid in interpreting weights.

MAP methods:

Now the penalty term is basically a Bayesian prior. L2 regularization corresponds to a Gaussian prior with mean zero. L1 regularization corresponds to a Laplacian prior with mean zero.

These methods don't average over weights, which could be veneficial for interpreting the weights.

Say we have a binary classification problem: . Can train a classifier using regression techniques and MSE as the loss function.

K classes, where .

Parameters: K weight vectors each dimension.

Learning algorithm: straightforward extensions of the binary case, but now there are additional subscripts.

This is more optimal than trying to independently learn boundaries between each class.