** Predictive Densities / Distributions **

This is a model for predicting future s, as opposed to the classification functions in predictive modeling.

Consider possible models each with parameters .

Because there are different number of parameters, can't use training data with to choose a model. The more complex models have more degrees of freedom and a better ability to fit the training data.

The marginal likelihood often cannot be computed in closed form. It often relies on unreliable approximations. Can be difficult to do in practice.

e.g. **Bayesian Information Criterion** (**BIC**): we want to maximize the log-likelihood of the training data given the model with a penalty term.

Objective function:

This is a good approximation to the marginal likelihood for certain types of models, but in general is only a heuristic.

Select the model that maximizes over k. can be or , etc.

Can also use an average over the parameters with respect to , known as **Bayesian predictive density**. This has the form .

“How well are we predicting new data?” Widely used in text prediction. Avoids issues about computing the marginal likelihood or heuristic assumptions above. The issue here is having enough data to support it.

This is not a “fully Bayesian” technique because we aren't computing a posterior or marginal likelihood and not learning from the test data.