Rather than exploring the entire space of all variables, try to get a good representative **sample**. A sample is an assignment of values to all variables.

Univariate samples are relatively simple to draw. Multivariate situations are more difficult.

[http://en.wikipedia.org/wiki/Monte_Carlo_method Monte Carlo methods] estimate the expectation of a function $f(X)$.

- Simulate a set of $n$ samples $X$ from the underlying distribution

#* Underlying distribution should be induced by known probabilistic network

- Compute arithmetic average of $f(X)$ for all samples to produce the
`sample mean`

#* $\hat{\mu}(f) = \frac{1}{n}\sum_{i=1}^n f(X^i)$

Since the sample mean is a function of the sample space, it is a separate distribution from the underlying $f(X)$. The expectation of the sample mean is the mean of $f$ and the variance of the sample mean is $\frac{\sigma^2}{n}$ where $\sigma^2$ is the variance of $f$.

Also called **logic sampling**. Extimates a probability $Pr(\alpha)$ by examining samples. Each full sample will be independent of the others. It uses Monte Carlo simulation to estimate a mean, then uses the mean as an estimate for $Pr(\alpha)$.

Uses a direct sampling function which maps variable instantiations to binary output. The sampling function returns 1 if an event is true, and 0 otherwise.

No evidence:

- order nodes in topological order
- sample variables one by one, conditioned on previously sampled variables for the rest
- take mean of all samples

With evidence $\beta$, want to estimate a conditional probability $Pr(\alpha|\beta)$. Called `rejection sampling`

because it effectively “rejects” or ignores samples in which the evidence isn't true.

- Use direct sampling to get $n$ samples
- Estimate $Pr(\alpha|\beta) = \frac{Pr(\alpha,\beta)}{Pr(\beta)}$ by ratio of the number of instantiations in which both are true with the number of instantiations in which the evidence was true.

Its variance equals $\frac{P(\alpha|\beta)P(\neg \alpha|\beta)}{n P(\beta)}$. So rejection sampling works best when the probability of the evidence is larger.

Rather than the direct sampling function, use the **importance sampling function**:
$\tilde \alpha (x) = P(x) / Q(x)$ when $\alpha$ is true at instatiation $x$.

Generation of random samples from an easy **proposal distribution** Q. Works better than direct sampling for estimating probabilities of rare events or when estimating conditional probabilities.

Also called the **importance distribution**.
The proposal distribution must be positive over the entire domain of the target distribution.
Convergence is faster when the proposal distribution Q is closer to the target distribution. Bucket elimination can also help the sampling process.

p(e) = sum_z p(z,e) = sum_z p(z,e) q(z)/q(z) = E_q p(z,e)/q(z) = E_q[w(z)] Where w is some function of the samples

Convergence is guaranteed by the law of large numbers. It is unbiased, and the variance can be computed as a functin of E_q(w) and T.

Importance sampling can also apply to marginals given evidence. In this case, convergence relies on the weak law of large numbers, and the sampler will be biased.

Choosing a very “cheap” Q distribution can make each sample faster to compute, at the cost of a higher rejection rate.

Adaptive importance sampling methods will update Q after sampling.
use an **estimator**: a function of the samples. It produces an estimate of the unkown parameter of the sampling `distribution`

.

P(x) = (0.3,0.7) g(X) = 40 if x = 0 and 50 if x = 1 Estimate E_p[g(x)] = (40*0.3 + 50*0.7) = 47

Transforms a Bayesian network into a **likelihood weighting network**. Instead of rejecting mismatched samples, fix evidence variables and weight all samples by evidence likelihood. Generates samples quickly, and converts to exact posterior marginals. Has variance no greater than that of direct sampling.

Sample non-evidence variables in topological order.

Iterative Join Graph Propogation sampling. Yields better approximations of P(X|E) than mini-bucket elimination. Output is the same as mini-bucket clusters. Currently (2013) the best performing importance sampling scheme.

Update the proposal distribution Q while sampling. This is a well-studied task.

[http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo Markov Chain Monte Carlo] applies to irreducible and recurrent markov networks. Each sample is dependent on the previous sample. Convergence to the target distribution may be slow as a result. The convergence speed “mixing rate” is an important topic of research.

Irreducible means there is a nonzero probability of reaching any state from any other state after a fixed number of steps.

Recurrent means there is probability 1 of returning to any state after a fixed number of steps. Positive-recurrent and aperiodic networks are ergodic.

**Burn-in** is an initial sampling period in which samples are typically thrown away.

Gibbs Sampling is one common use of MCMC.

Use cutset conditioning to reduce the width of the graph…