Michael Data

Machine Learning is a class of techniques for learning a model of data, frequently used in data mining and business intelligence.

Learning refers to learning the parameters of a model. It is also referred to as inference in Bayesian methods. Once a model has been learned, it might be used for predictive modeling or exploratory data analysis.

In a generative model, there is also usually a way to “go back” and generate or simulate data by fixing the parameters.

From a Bayesian framework, we would want to estimate something like .
The prior probability over can be controvercial, so ignoring it or using a flat prior, the likelihood alone is typically used to carry the parameter estimation.

Conditional Likelihood for , and .
Can write likelihood as using conditional independence. This is a conditional model, but hasn't defined the actual probabilities yet. The one which is most commonly used in machine learning is the logistic function:  It defines a linear decision boundary. This is logistic regression, an example of a generalized linear model.

Generative Approach

<Merge with Classification>

Learn a model for a joint distribution to predict class label of a new vector .
We can compute using Bayes' rule. . Key Points

• * Learn a model for how the s are distriubted for each class
• i.e. uses parameters for class • requires a distributional/parametric assumption
• e.g. Gaussian multivariate model
• * Also have to learn values, though this is easy
• * Likelihood decomposes into separate optimization problems if s are unrelated
• * This approach is theoretically optimal if:
• The distributional assumptions are correct
• we can learn the true/optimal parameters
• * Predict using Bayes' rule: Gaussian Example Need to learn sets of parameters , for .
There is sensitivity to the Gaussian assumption. Due to parameters per class, this can scale poorly as d increases. In practice, in high-dimensional problems you can assume that the s are diagonal.

Naive Bayes Example

In Naive Bayes Classification, you model .

Markov Model Example

With sequence data, can do classification by learning a Markov Model for each class. • * is sequence number • * is the transition probability from to .
Bayesian Estimation

Treat the parameters as random variables.
In particular, before observing any data, there is the prior , a prior density for .

As more data is gathered, the role of the prior is reduced. It is more influential when data is limited.  is known as the posterior density. In comparison to Maximum Likelihood Estimation: Gaussian Fish Example = mean weight of fish in a lake, assuming Gaussian likelihood. Here is the mean of the prior and is variance of the prior.

Bernoulli Parameter Example , , . , A common choice for a prior on is the Beta density: . Properties of the Beta Prior Properties of the Posterior

The posterior is also in a Beta form due to conjugacy. The MPE effectively smooths the maximum likelihood estimate. The larger , are more smoothing. and are referred to as pseudo counts because they effectively count the number of trials and successes of previous trials. is the pseudocount for the number of successes, and is the pseudocount for the prior number of trials.

As , MPE for . Variance of as .

Conjugacy

The posterior density is the same form as the prior when using a conjugate prior. In this case, the Beta is a conjugate prior to the Bernoulli. Multinomial Parameter Example , , , .
e.g. = occurrence of a word in a document. = number of unique words. = number of 's taking value in . . . The maximum likelihood in this case may require smoothing if we don't want it assigning zero probabilities.

A conjugate prior to the multinomial is the Dirichlet distribution. This is a generalization of the Beta to higher dimensions. The parameters are directly analogous to the Beta Binomial prior parameters.
e.g. for text, the s could be proportional to frequency of words in English text.

The posterior density will have the form where .

The prior mean .
The posterior mean .

Gaussian Parameter Example

Common in tracking problems. Assume that the movement of the object has some Gaussian noise to it. , , assuming , and s are conditionally independent given .

Known Variance The conjugate prior is Gaussian. where is the mean of the prior, and represents uncertainty about the prior.

Posterior  