Gradient methods are commonly used for expectation maximization, logistic regression, and other types of supervised learning.

Given a parameterized error function, want to learn the best setting for the parameters to minimize error.

First-order methods take a first-order derivative of the error function. For large number of parameters, second-order methods converge more slowly in practice than first-order methods.

$\underline{\theta}^{new} = \underline{\theta}^{current} - \alpha \nabla ( MSE(\underline{\theta}^{current}))$ where $\alpha$ is a step size and $\nabla$ is a gradient

Example data: $Y_{j i} | x_j$ for $i = 1\dots n$.

Compute the gradient $\nabla_\theta (E)$ over all data points, then update the parameters.

Basic idea: compute the gradient on a small number of data points (**mini-batch**) at a time, or one at a time.

- * $\underline{\theta}^{new} = \underline{\theta}^{old} + \epsilon (f_i-c_i)\underline{x}_i$
- * $\beta_j^{new} = \beta_j^{old} - \gamma(p_i - c_i)x_{ij}$ for $j = 1, \dots, d$
- * Can randomly select a data point at each iteration (with replacement)
- * Can shuffle all the data, then iterate through the shuffle list

It results in cheap and noisy updates, but can converge faster than batch gradient descent for large $N$. In practice, stochastic gradient descent is faster than the batch method. May find convergence before even passing through the data points once.

With sparse data, can update based only on features that are present in the example. So if the observations only have average $f$ non-zero features out of $d$ features total, an update becomes $O(f)$ instead of $O(d)$.

- * Decrease step size on a schedule over time

Criterion for “Best”. Convention is to use the “Least Squares” method.

$SSE(\beta_0,\beta_1) = \sum_{i = 1}^n \left( Y_i - (\beta_0 + \beta_1 x_i)\right)^2 $

$(\hat \beta_0,\hat \beta_1) = arg min_{(\beta_0,\beta_1)} SSE(\beta_0,\beta_1) $

$(Y|x) \sim \mathcal{N}\left(\mu(x) , \sigma^2 (x)\right)$ $\mu(x_i) = \beta_0 + \beta_1 x_i$ $\sigma^2(x_i) = \sigma^2$

Point estimator for $\mu(x_1)$ is:

- * $\bar Y_1 = \hat \mu (x_1) = \frac{1}{n} \sum_{i=1}^n Y_{1 i} $

The actual estimate using real data is then

- * $\bar y_1 = \frac{1}{n} \sum_{i = 1}^n y_{1 i}$

$Y_i = \mu(x_i) + \epsilon_i = \beta_0 + \beta_1 x_i + \epsilon_i$

- * $\epsilon_i \overset{iid}{\sim} \mathcal{N} (0,\sigma^2)$
- * $Y_i|x_i \sim \mathcal{N}(\mu(x_i) = \beta_0 + \beta_1 x_i,\sigma^2)$

A general optimization scheme, it is a classic way to solve logistic regression. Want to use second-order derivatives to optimize the iterative update.

Weight update: $w_{new} = w_{old} + H^{-1}_w \nabla l(w_{old})$ (derived by second-order Taylor expansion)

$H_w$ is the [http://en.wikipedia.org/wiki/Hessian_Matrix Hessian matrix] of second-order partial derivatives with respect to data dimensions, evaluated at $w$. $H_{i,j} = \frac{\partial}{\partial w_i} \frac{\partial}{\partial w_j} l(w)$

$\nabla l(w)$ is a $dx1$ vector of partial derivatives evaluated at $w$. $\nabla_i l(w) = \frac{\partial}{\partial w_i} l(w)$.

This is similar to **gradient ascent**: $w_{new} = w_{old} + \epsilon \nabla l(w)$. Instead of the gradient ascent step size, we're using the Hessian to go straight to an optimum.

$\nabla l(w) = \sum_{i = 1}^N (f_i - c_i) x_i $ $\nabla l(w) = X^T (f - c)$

- * $f_i = f(x_i ; w) = p(c_i|x_i,w)$
- * $c$ is the vector of class labels in the form of zeros and ones. (Binary logistic regression)
- * $f$ are our predictions
- * $c$ is the column-vector of true labels
- * $X$ is a “stack” of transposed input vectors.

<latex>\begin{align*}

- H &= -\sum_{i=1}^N x_i x_i^t f_i (1 - f_i)

&= - X^T R X

& & f_i = f(x_i ; w)

& & \text{R is a diagonal matrix such that\ } R_{ii} = f_i(1-f_i)
\end{align*}
</latex>

- Initialize the weights
- Compute $w_{new}$ using the Newton-Raphson weight update function
- By computing the gradient and Hessian at the current weights
- $w_{new} = w_{old} - (X^T R X)^{-1} X^T (f - c)$

- Check for convergence (are we at the peak?)
- Is the change to the weights very small relative to the data scale?
- If not, return to step 2

For d dimensions on N data, each iteration of the update is O(d^3 + Nd).

- * $O(d^3)$ for the matrix inversion
- * $O(dN)$ for the gradient

In high dmensions, this cost can be mitigated somewhat by `regularizing`

: use $(X^T R X + \alpha I)^{-1}$.

This is also related to statistical linear regression (a.k.a. weighted least squares updates a.k.a. iteratively reweighted least squares), in that a linear regression problem is solved at each iteration. $w_{new} = w_{old} - (X^T R X)^{-1} X^T (f - c) = (X^T R X)^{-1} X^T R z$ where $z = X w_{old} - R^{-1}(f-c)$

See notes in [http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online the Barber text] and [http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf notes by Charles Elkan]

It can also be useful to normalize features to have mean 0 and standard deviation 1.