# Michael Data

See Notes on Logistic Regression by Charles Elkan.

##### Logistic Regression

Considered discriminative: As opposed to generative models, computes directly.

Very widely used. The linear function is the probability model, and the function is a transformation to enforce constraints of a useful probability.
Logistic Model: , where .  with weights .
Can also be interpreted as a linear weighted sum of the inputs

This defines a linear decision boundary. It can also be written as , which is known as a log-odds function. It converts the result from the range [0,1] to the range [-inf, +inf].

This is the same general form as a Gaussian classification model using equal covariances. Unlike the Gaussian model, which assumes a model for each class, the weights of the logistic model are unconstrained and may move freely. A perceptron classificatio model also uses a linear decision function, but with a single threshold. In these ways, the logistic classifier is a generalization of perceptrons and Gaussian classifiers.

#### Log-Loss

The “log-loss function” gives appropriate log-loss to each class: This is is considered a more principled loss function than MSE, but doesn't always perform differently in practice.

##### Learning the Weights

Let . (The has been absorbed into the xs)

Likelihood Log-Likelihood Now the goal is to maximize this Log-Likelihood with respect to the weights. There is typically no closed-form solution to solving for the weights. This function is concave, which means there is a single global maximum. Solve it using an iterative gradient-based search:

##### Non-linearity in the Inputs

This method increases the opportunity for overfitting.

Replace x by mapping it to a non-linear feature space: The features are replaced by some function of them Logistic regression then learns a linear model in this space, which may be a non-linear model in the original feature space. You typically wouldn't do this in high dimensions, but this can help when separating data is a problem.

##### Regularization and Priors

Want to learn a general model which will be effective on future/unseen data.
In order to avoid overfitting to the training data, want to have some kind of penalty for unnecessarily large weights.

#### L2 Regularization

a.k.a. ridge regression.
Instead of maximizing log-likelihood, In this case, weights “have to justify their existence in the model”. The lambda term pressures weights to be as small as they can. Lambda can be set through cross-validation.

#### L1 Regularization

a.k.a. the “Lasso method”
More inclined to drive weights to zero faster than L1. By identifying a smaller set of predictors, it can aid in interpreting weights. #### Bayesian

MAP methods: Now the penalty term is basically a Bayesian prior. L2 regularization corresponds to a Gaussian prior with mean zero. L1 regularization corresponds to a Laplacian prior with mean zero.

These methods don't average over weights, which could be veneficial for interpreting the weights.

##### Logistic Regression Classification

Say we have a binary classification problem: . Can train a classifier using regression techniques and MSE as the loss function. #### 1-Dimensional Example As .
As .

#### Multiclass Logistic Regression

K classes, where . Parameters: K weight vectors each dimension.

Learning algorithm: straightforward extensions of the binary case, but now there are additional subscripts.

This is more optimal than trying to independently learn boundaries between each class. 