Michael Data

See Notes on Logistic Regression by Charles Elkan.

Logistic Regression

Considered discriminative: As opposed to generative models, computes $p(c_i|x_i)$ directly.

Very widely used. The linear function is the probability model, and the $f$ function is a transformation to enforce constraints of a useful probability.
Logistic Model: $p(c=1|x) = f(x^t x + w_0)$, where $f(z) = \frac{1}{1 + e^{-z}}$.
$ p(c_i|x_i) = \frac{1}{1+e^{-(w^t x_i + x_0)}} $

$ p(c_i|x_i) = \frac{1}{1+e^{-(\sum_{j=1}^d \beta_jx_j)}} $ with weights $\beta_1 \dots \beta_d$.
Can also be interpreted as a linear weighted sum of the inputs

This defines a linear decision boundary. It can also be written as $w^t x + w_0 = \log \left( \frac{p(c=1|x)}{1-p(c=1|x)} \right)$, which is known as a log-odds function. It converts the result from the range [0,1] to the range [-inf, +inf].

This is the same general form as a Gaussian classification model using equal covariances. Unlike the Gaussian model, which assumes a model for each class, the weights of the logistic model are unconstrained and may move freely. A perceptron classificatio model also uses a linear decision function, but with a single threshold. In these ways, the logistic classifier is a generalization of perceptrons and Gaussian classifiers.

Log-Loss

The “log-loss function” gives appropriate log-loss to each class:
$ \log L(\theta) = \sum_{i=1}^N c_i \log f(x_i;\theta) + (1-c_i) \log \left(1-f(x_i;\theta)\right) $
This is is considered a more principled loss function than MSE, but doesn't always perform differently in practice.

Learning the Weights

Let $f(x_i) = \frac{1}{1 + e^{-w^t x_i}}$. (The $w_0$ has been absorbed into the xs)

Likelihood $ = \prod_{i=1}^N p(c_i|x_i) = \prod_{i=1}^N f(x_i|w)^{c_i} (1 - f(x_i|w))^{1-c_i}$
Log-Likelihood $ = \sum_{i=1}^N c_i \log f(x_i|w) + (1-c_i) \log(1 - f(x_i|w))$

Now the goal is to maximize this Log-Likelihood with respect to the weights. There is typically no closed-form solution to solving for the weights. This function is concave, which means there is a single global maximum. Solve it using an iterative gradient-based search:

Non-linearity in the Inputs

This method increases the opportunity for overfitting.

Replace x by mapping it to a non-linear feature space: $x \to [x_1, x_1^2, x_1 x_2, x_2, x_2^2, \dots ]$
The features are replaced by some function of them $\phi(x)$

Logistic regression then learns a linear model in this space, which may be a non-linear model in the original feature space. You typically wouldn't do this in high dimensions, but this can help when separating data is a problem.

Regularization and Priors

Want to learn a general model which will be effective on future/unseen data.
In order to avoid overfitting to the training data, want to have some kind of penalty for unnecessarily large weights.

L2 Regularization

a.k.a. ridge regression.
Instead of maximizing log-likelihood, $\text{maximize}_{w} \log L(w) - \lambda \sum_{j=1}^d w_j^2 $ In this case, weights “have to justify their existence in the model”. The lambda term pressures weights to be as small as they can. Lambda can be set through cross-validation.

L1 Regularization

a.k.a. the “Lasso method”
More inclined to drive weights to zero faster than L1. By identifying a smaller set of predictors, it can aid in interpreting weights.
$maximize_{w} \log L(w) - \lambda \sum_{j=1}^d |w_j| $

Bayesian

MAP methods:
$maximize_{w} \log L(w) + \log p(w)$
Now the penalty term is basically a Bayesian prior. L2 regularization corresponds to a Gaussian prior with mean zero. L1 regularization corresponds to a Laplacian prior with mean zero.

These methods don't average over weights, which could be veneficial for interpreting the weights.

Logistic Regression Classification

Say we have a binary classification problem: $y_i \in \{0,1\}$. Can train a classifier using regression techniques and MSE as the loss function.

$$ E[y|x] = \sum_y y p(y|x) = 1 * p(y=1|x) + 0 * p(y=0|x) = p(y=1|x) $$

1-Dimensional Example

$$p(c=1|x) = \frac{1}{1 + e^{w x + w_0}}$$

As $x \to \infty, p(c=1|x) \to 1$.
As $x \to -\infty, p(c=1|x) \to 0$.

Multiclass Logistic Regression

K classes, where $c_i \in {1,2,\dots K}$.
$p(c=k|x,w) = \frac{e^{w_k^t x}}{\sum_{k=1}^K e^{w_k^t x}}$

Parameters: K weight vectors $w_k$ each dimension.

Learning algorithm: straightforward extensions of the binary case, but now there are additional subscripts.

This is more optimal than trying to independently learn $O(K^2)$ boundaries between each class.