Michael Data

Interpretation of Probability

$P(a)$ or $P(a|b)$
* $a$ is a proposition that is true or false
* Frequentist or Bayesian interpretation of the value?
Frequentist: $P(a)$ is the relative frequency with which $a$ occurs in repeated trials
Bayesian: $P(a)$ is a degree of belief of an “agent” that proposition $a$ is true

Properties of Random Variables

Values of the variable are mutually exclusive and exhaustive.


The variable $A$ takes discrete values. $a \in {a_1, \dots a_m}$. Discrete distributions can be represented as a table of probabilities for each value. Alternately, the distribution can be represented by some function of the values.


The variable takes continuous values. $a \in \mathbb{R}$. The distribution is typically represented with a density function. There is no real bound on the height of the density curve.

# $p(x) \geq 0$
# $\int p(x) dx = 1 $
# $p(+ \leq x \leq r) = \int_+^r p(x) dx $

Sets of Random Variables

e.g. consider discrete random variables A,B,C. Each takes m different values, with $m > 2$.

$P(a,b,c)$ is a joint distribution. It can be represented as a table with $m^3$ probabilities for each permutation of values. $\sum_{a,b,c} P(a,b,c) = 1$.

Conditional Distributions

e.g. $P(a | b_j,c_k)$ can now be represented as a table of $m$ values, because the values of $B$ and $C$ have been fixed. The conditioned variables are either known or assumed to have those values. $\sum_{i = 1}^m p(a_i|b_j,c_k) = 1$.

If we were to plot $p(a |b )$ across different values of $b$, the result is not a density function and does not have to integrate or sum to 1.

Factorization or Chain Rule

How do we get from a joint distribution to a continuous distribution?

$p(a,b) = p(a|b) p(b)$

$p(a,b,c) = p(a|b,c) p(b,c) = p(a|b,c) p(b|c) p(c)$

But order doesn't matter.
$p(a,b,c) = p(b|a,c) p(a,c) = p(b|a,c) p(c|a) p(a)$

Deriving Bayes' Rule

Factorization provides an easy derivation for Bayes' rule:
Let's represent $p(a|b,c)$.

$p(a,b,c) = p(a|b,c) p(b,c) = p(b,c | a) p(a)$

$p(a|b,c) = \frac{p(b,c|a) p(a)}{p(b,c)}$

Law of Total Probability

How can we get from a joint distribution to unconditional a.k.a. marginal distributions?
For discrete random variables, sum out the extra variables:
$p(a) = \sum_{b} p(a,b)$

$p(a) = \sum_{b,c} p(a,b,c)$

$p(b,c) = \sum_{a} p(a,b,c)$

$p(a|c) = \sum_{b} p(a,b|c)$

For continuous random variables, integrate out the extra variables.

Conditional Independence

A conditional independence assumption allows reduction of the amount of information required when storing or working with a joint distribution.

e.g. for discrete random variables A,B,C,D which each take m values:
By factorization, $p(a,b,c,d) = p(a|b,c,d) p(b|c,d) p(c|d) p(d)$
This uses at least $m^4$ terms!

We can assume that A is conditionally independent of C and D given B: $p(a|b,c,d) = p(a|b)$

Now we factorize like this: $p(a,b,c,d) = p(a|b) p(b|c,d) p(c|d) p(d)$
Now the largest table of the factorized representation is $m^3$ terms.

Genetics Example

G = maternal grandmother's genes
M = mother's genes
Y = your genes

Y and G are conditionally independent, given M.

$p(Y|M,G) = p(Y|M)$
$G \to M \to Y$

Weather Example

I = Irvine temperature
B = Beijing temperature
M = month of year

I and B are conditionally independent, given M.

Linear Correlation

$\displaystyle \rho_{ij} = \frac{cov(i,j)}{\sigma_i \sigma_j}$
This is a scaled covariance, and useful…