MLE is a form of learning from data based on the likelihood technique. It can be done in closed form for simple models, but often requires numerical (e.g. gradient methods) in general.

What does it have to do with Gibbs Sampling?

In some model with parameters and a likelihood term of the form , and observations in data where binary variables.

The **likelihood** is the probability of the observed data. Can be called or .

The basic assumption of **MLE** is that when , the model defined by is probably more accurate. Maximizing the parameter is **maximum likelihood**.

Because the likelihood is a product of probabilities, it tends to be very small. For numerical and analytical reasons, it is common to work with the log-likelihood. Max/min of the log-likelihood produces the same result as max/min of the likelihood.

`D`

: data set where each can be binary, categorical, real-valued values of a random variable. Each is of dimension , so there are samples of length in an matrix of data.

- * How are the samples related? Independence?
- * How are the data features related?

is the maximum likelihood estimate of .

Can assume a conditional independence model, with parameter

- * The are mutually independent conditioned on .

Univariate Gaussian distribution, parameterized by .

Say is known. Want to maximize :

Take logs, drop **constant** terms that don't depend on .

Maximizing the log-likelihood is, in this case, equivalent to minimizing the squared error of a model over the data. In this case, the max likelihood is equivalent to the observed mean.

Say is unkown. Now includes the mean and covariance.

-dimensional vector. Assume , where are unknown.

- * Data samples are independent
- * Features correlated according to .

In this case it is easier to work with log-likelihood.

Multinomial model:

- *
- where for
- and

Sufficient statistics: = the number of times result appears in the data.

Plug this value of into the sum .

is uniformly distributed with lower limit and upper limit , where .

else

Given data set , we know that and .

This increases as increases. Because is bounded by , the maximum likelihood estimator for a is then .

= d-dimenional feature vector

= class variable (categorical);

Training data: where and , for and .

Classification problem: Learn or estimate for classification of new .

Assume that pairs are conditionally independent given model parameters

By separating into multiple terms, each dependent on one unknown variable, each can be maximized or minimized independently.

Solution for will be of the form where .

Generalizations:

For , for Naive Bayes we assume . e.g. all d densities could be Gaussian, this is the product of 1-dimensional density functions.

Let be the **data generating distribution** a.k.a. **data-generating function**, and data D be independent and identically distributed observations.

Let be our model, which might not include q.

The [http://en.wikipedia.org/wiki/Kullback-Leibler_Distance Kullback-Leibler Divergence] = with equality iff .

Let .

Given a random (IID) sample set from .

Define .

.

In general, if we have a “model mis-specification”, gets as close as possible in a KL-snse to as .

Max-likelihood is a **point estimate** given observed data. It puts very high emphasis on the data by dismissing priors. This can be a bad representation when the data is unreliable or may lead to excluding future observations.