Gradient methods are commonly used for expectation maximization, logistic regression, and other types of supervised learning.

Given a parameterized error function, want to learn the best setting for the parameters to minimize error.

First-order methods take a first-order derivative of the error function. For large number of parameters, second-order methods converge more slowly in practice than first-order methods.

where is a step size and is a gradient

Example data: for .

Compute the gradient over all data points, then update the parameters.

Basic idea: compute the gradient on a small number of data points (**mini-batch**) at a time, or one at a time.

- *
- * for
- * Can randomly select a data point at each iteration (with replacement)
- * Can shuffle all the data, then iterate through the shuffle list

It results in cheap and noisy updates, but can converge faster than batch gradient descent for large . In practice, stochastic gradient descent is faster than the batch method. May find convergence before even passing through the data points once.

With sparse data, can update based only on features that are present in the example. So if the observations only have average non-zero features out of features total, an update becomes instead of .

- * Decrease step size on a schedule over time

Criterion for “Best”. Convention is to use the “Least Squares” method.

Point estimator for is:

- *

The actual estimate using real data is then

- *

- *
- *

A general optimization scheme, it is a classic way to solve logistic regression. Want to use second-order derivatives to optimize the iterative update.

Weight update: (derived by second-order Taylor expansion)

is the [http://en.wikipedia.org/wiki/Hessian_Matrix Hessian matrix] of second-order partial derivatives with respect to data dimensions, evaluated at .

is a vector of partial derivatives evaluated at . .

This is similar to **gradient ascent**: . Instead of the gradient ascent step size, we're using the Hessian to go straight to an optimum.

- *
- * is the vector of class labels in the form of zeros and ones. (Binary logistic regression)
- * are our predictions
- * is the column-vector of true labels
- * is a “stack” of transposed input vectors.

- Initialize the weights
- Compute using the Newton-Raphson weight update function
- By computing the gradient and Hessian at the current weights

- Check for convergence (are we at the peak?)
- Is the change to the weights very small relative to the data scale?
- If not, return to step 2

For d dimensions on N data, each iteration of the update is O(d^3 + Nd).

- * for the matrix inversion
- * for the gradient

In high dmensions, this cost can be mitigated somewhat by `regularizing`

: use .

This is also related to statistical linear regression (a.k.a. weighted least squares updates a.k.a. iteratively reweighted least squares), in that a linear regression problem is solved at each iteration.

where

See notes in [http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online the Barber text] and [http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf notes by Charles Elkan]

It can also be useful to normalize features to have mean 0 and standard deviation 1.