# Michael Data

##### Maximum Likelihood Estimation

MLE is a form of learning from data based on the likelihood technique. It can be done in closed form for simple models, but often requires numerical (e.g. gradient methods) in general.

What does it have to do with Gibbs Sampling?

##### Likelihood

In some model with parameters and a likelihood term of the form , and observations in data where binary variables.

The likelihood is the probability of the observed data. Can be called or .

The basic assumption of MLE is that when , the model defined by is probably more accurate. Maximizing the parameter is maximum likelihood.

Because the likelihood is a product of probabilities, it tends to be very small. For numerical and analytical reasons, it is common to work with the log-likelihood. Max/min of the log-likelihood produces the same result as max/min of the likelihood.

##### Data Notation

D : data set where each can be binary, categorical, real-valued values of a random variable. Each is of dimension , so there are samples of length in an matrix of data.

• * How are the samples related? Independence?
• * How are the data features related?

is the maximum likelihood estimate of .

#### Coin Tossing

Can assume a conditional independence model, with parameter

• * The are mutually independent conditioned on .

#### Univariate Gaussian Example

Univariate Gaussian distribution, parameterized by .

### Known Variance

Say is known. Want to maximize :

Take logs, drop constant terms that don't depend on .
Maximizing the log-likelihood is, in this case, equivalent to minimizing the squared error of a model over the data. In this case, the max likelihood is equivalent to the observed mean.

### Unknown Variance

Say is unkown. Now includes the mean and covariance.
-dimensional vector. Assume , where are unknown.

• * Data samples are independent
• * Features correlated according to .

In this case it is easier to work with log-likelihood.

#### Multinomial Example

Multinomial model:

• *
• where for
• and

Sufficient statistics: = the number of times result appears in the data.

Plug this value of into the sum .

#### Uniform Example

is uniformly distributed with lower limit and upper limit , where .

else

Given data set , we know that and .

This increases as increases. Because is bounded by , the maximum likelihood estimator for a is then .

##### Naive Bayes MLE Classification Example

= d-dimenional feature vector
= class variable (categorical);

Training data: where and , for and .

Classification problem: Learn or estimate for classification of new .

Assume that pairs are conditionally independent given model parameters

By separating into multiple terms, each dependent on one unknown variable, each can be maximized or minimized independently.

Solution for will be of the form where .

Generalizations:
For , for Naive Bayes we assume . e.g. all d densities could be Gaussian, this is the product of 1-dimensional density functions.

##### Theoretical Properties of Maximum Likelihood

Let be the data generating distribution a.k.a. data-generating function, and data D be independent and identically distributed observations.
Let be our model, which might not include q.

The [http://en.wikipedia.org/wiki/Kullback-Leibler_Distance Kullback-Leibler Divergence] = with equality iff .

Let .
Given a random (IID) sample set from .
Define .
.

In general, if we have a “model mis-specification”, gets as close as possible in a KL-snse to as .

##### Issues

Max-likelihood is a point estimate given observed data. It puts very high emphasis on the data by dismissing priors. This can be a bad representation when the data is unreliable or may lead to excluding future observations.