Say we have K models: $m_1, m_2 \dots m_k$, each with parameters $\theta_1, \theta_2 \dots \theta_k$. e.g. model 1 is exponential, model 2 is Gaussian, model 3 is MOG, etc.

Choosing the maximum-likelihood model is a simple recipe for overfitting. Models with more parameters will tend to fit the data more closely, so the issue of model complexity has to be addressed. One way to do so is to use some test data to choose a better model/parameters.

Rather than just choose a single most-likely model, we want to compute $P(m_k | D_{train})$.
$$P(m_k | D_{train}) = p(D_{train}|m_k)p(m_k)$$ In practice, the prior over models can be very controversial, so the prior tends to be kept pretty weak. As a result, the `likelihood`

portion tends to dominate.
$$p(D_{train|m_k}) = \int p(D_{train,\theta|m_k) d\theta$$ $$ = \int p(D_{train|\theta,m_k) p(\theta|m_k) d\theta$$ $$ = \int Likelihood(\theta) * Prior(\theta) d\theta $$
This is called the **marginal likelihood**, and can be used to select a model by choosing the one with the largest marginal likelihood.

This can be very difficult to compute or estimate if not available in closed form. As the dimension of \theta goes higher, there is a computational cost.

$p(x_{new}|d_{train}) = \sum_{k=1}^K p(x_{new}|m_k,D_{train}) p(m_k|D_{train})$ (Predictive density for the given model) * (marginal likelihood of the model).

<latex>\begin{align*}

- p(x|D) &= \sum_{k=1}^K p(x|D_1,m_k) p(M_k|D)

p(x|D_1,m_k) &= \int p(x|\theta_k,M_k) p(\theta_k | M_k,D) d\theta_k & \text{the predictive density for kth model $M_k$}

p(M_k|D) &= \frac{p(D|M_k)p(M_k)}{\sum_{j=1}^K p(D|M_j)p(M_j)}
\end{align*}</latex>

- * $p(D|M_k)$ is
**marginal likelihood** - * $p(M_k|D)$ is
**Bayesian model weights**

In practice, Bayesian model averaging doesn't work ver well compared to other ensemble methods.