Michael Data

Tricks

Trace Trick
The trace is the sum of the diagonals of a matrix. $tr(A) = \sum_i A_{i,i}$
The trace operator has the cyclic permutation property: $tr(x^T A x) = tr(A x^T x) = tr(x^T x A)$

Dealing with Mahalanobis terms
Expand out $(x-\mu)^T\Sigma^{-1}(x-\mu)$ into $(x^T\Sigma^{-1}x) - (x^T\Sigma^{-1}\mu) - (\mu^T\Sigma^{-1}x) + (\mu^T\Sigma^{-1}\mu) $

Glossary

sufficient statistics = A representation that preserves all information of original data. “The minimum data needed to use a model”

Interpretation of Probability
$P(a)$ or $P(a|b)$
* $a$ is a proposition that is true or false
* Frequentist or Bayesian interpretation of the value?
Frequentist: $P(a)$ is the relative frequency with which $a$ occurs in repeated trials
Bayesian: $P(a)$ is a degree of belief of an “agent” that proposition $a$ is true

Mixing Discrete and Real-valued Variables
(See problem 8 of HW1)

$C$ is binary, e.g. (0=Healthy,1=Flu)
$X$ is real valued, e.g. temperature of a patient

Can think of:
* $P(X|C)P(C)$
Conditioned on binary value of C, this is a mixture of two continuous distributions
* $P(C|X)P(X)$
Conditioned on continuous value X, this is a probability across the categories of C

Multivariate Models
(Text section 2.5 and note set 2)
$X$ is a d-by-1 vector of real values. $\mu$ is the expected value vector for each $X$.
$\mu_i = \sigma X_i P(X_i)$

Covariance

  • $\sigma_{ij}$ = covariance between $x_i$ and $x_j$
    • = $E_{p(x_i,x_j)}[(x_i - \mu_i)(x_j - mu_j)]$
  • $\Sigma$ is the covariance matrix. It is symmetric: $\Sigma_{ij} = \Sigma_{ji}$

Linear Correlation

  • $\rho_{ij} = \frac{cov(i,j)}{\sigma_i \sigma_j}$

Limits of Correlation Measures: These are summaries, and fail to capture important aspects in certain circumstances. e.g. a square toroidal relationship between two variables.

Gaussian Multivariate Density Function

Characterized by parameters $\mu$ and $\Sigma$, the mean and covariance matrix.

  • For $d$ dimensions, the number of parameters needed grows as $d^2$ due to the covariance matrix.
  • The Gaussian assumes only pairwise linear dependencies.

$\displaystyle P(x) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} e^{-\frac{1}{2}[(x-\mu)^t \Sigma (x-\mu)]} = \frac{1}{C} e^{-\frac{1}{2} d(x,\mu)}$

This $d(x,\mu)$ is effectively a distance function between the target and the mean. As the distance increases, the exponential component of the equation becomes a smaller and smaller value.

$\displaystyle
d(x_1,x_2) = 
\begin{pmatrix}
 x_1 & x_2 \\
\end{pmatrix}
\begin{pmatrix}
 1 & \rho \\
 \rho & 1
\end{pmatrix}
\begin{pmatrix}
 x_1 \\
 x_2
\end{pmatrix}
=
\frac{x_1^2}{\sigma_1^2} + \frac{x_2^2}{\sigma_2^2}
\arrowleft$

=Note Set 2=
Sections 10.1-10.2 in the text

Conditional Independence

More general than mutual independence.
$a,b$ are conditionally independent given $c$ iff $(a,b|c) = p(a|c)p(b|c)$.
Equivalently:

$$P(a|b,c) = p(a|c) , P(b|a,c) = p(b|c)$$

Marginal Independence

$\Rightarrow p(a,b) = p(a)p(b)$

$$p(a|b) = p(a) , p(b|a) = p(b)$$

Isn't in general implied by marginal independence.
Learning one value doesn't change distribution over the other. This is frequently an assumption used to build models.

Markov Assumption

First-order: Given the previous entry in a sequence, the next entry is independent on all previous parts of the sequence.

$$P(X_t|X_{t-1}) = P(X_t|X_{t-1},X_{t-2},\dots$$

For higher orders, $P(X_t)$ will depend on more than one other variable.

Naive Bayes Assumption

$C$: random variable with values $c_1\dotsc_m$ is a “class variable”.
With $d$ real-valued features $X = (x_1,x_2,\dots x_d)$

Want to infer $(C|X)$.
Using Bayes' Rule over all values of $C$: $P(C|X) \alpha P(X|C)P(C) $
Don't worry about normalization term (evidence) because the most likely $C$ can be found using these proportions.

This model tends to be too overconfident, but it tends to do well for ranking or comparing classes.

Graphical Models

“Directed graphical models” were invented in genetics in the 1920s, then were revived in the 1970s as “Bayesian Networks”. They are a systematic way to represent and compute sets of random variables and associated conditional independence assumptions.

Associate each random variable with a node in a directed graph. A directed edge is drawn from $x_i$ to $x_j$ when $x_j$ depends directly on $x_i$. There are no directed cycles allowed.

$$p(x_1,x_2,\dots) = \Pi{p(x_j|x_{j-1}}$$

Markov model: $p(x_1,x_2,\dots) = \Pi_{d=2}\left[p(x_j|parents(x_j))\right]p(x_1)$

Naive Bayes model: $p(x_1,x_2,\dots|C) = \Pi{p(x_j|C}p(x_1)$

Computing in Graphical Models

$p(d|a) = ?$ when $a \to b \to c \to d$
First expand the full joint: $p(d|a) = \sum_c\sum_b p(d,c,b|a)$
Then factor the join: $ p(d|a) = \sum_c\sum_b p(d|c,b,a) p(c|b,a) p(b|a) $
Then can simplify the conditionals by C.I. from the graph: $ p(d|a) = \sum_c p(d|c) \sum_b p(c|b) p(b|a) $
Exploiting the independence assumption allows for more efficient computing over these sums.

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Check out Murphy text
* page 99 for example of MLE of Multivariate Normal

Note Set 3

Example 6 in describes a mixture model where the MLE solution is not obvious.

Note Set 4
Bayesian Parameter Estimation

Treat $\theta$ as a random variable. $p(\theta)$ = prior distribution/density on $\theta$ before considering any data. $p(D|\theta)$ = Likelihood.

$$ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \alpha p(D|\theta)p(\theta) $$

So using a Binomial model with a Beta prior…

$$ = P(D|\theta) p(\theta) = \theta^r (1-\theta)^{1-r} * \theta^{\alpha-1} (1-\theta)^{\beta - 1} $$

$$ = \theta^{r + \alpha - 1} (1 - \theta)^{N - r + \beta - 1} = Beta(\alpha', \beta')$$

where $\alpha' = r + \alpha$ and $\beta' = N - r + \beta$.

In this case, the posterior mean = $E_{p(\theta|D)} [\theta] = \frac{r + \alpha}{N + \alpha + \beta}$. This means that if $\alpha = 1, \beta = 1$, or there is a “flat prior” and the posterior mean equals the mode of $p(\theta|D)$ and $p(\theta|D)$ goes to the MLE.

Sequential Learning in a Bayesian Framework

Two data sets D1 and D2 which have IID likelihood and parameters $\theta$.

Posterior is proportional to the Likelihood * Prior.

$$p(\theta|D)_1,D_2) \propto p(D_1,D_2 | \theta) p(\theta) $$

$$ = p(D_2| D_1, \theta) p(D_1|\theta) p(\theta) $$

Under the IID data assumption:

$$ = p(D_2| \theta) p(D_1|theta) p(\theta) $$

Where Naive Bayes refers to conditionally independent variables, the data samples are assumed to be conditionally independent here. In this way, new data can be evaluated sequentially.

Beta-Binomial Example

Let D1 be one success out of 4 trials and D2 is 5 successes out of 8 trials.
$p(\theta) = Beta(\alpha,\beta) \to Beta(1,1)$
$D_1 = p(\theta|D_1) = Beta(\alpha + r, \beta + N - r) \to Beta(2,4)$
$D_2 = p(\theta|D_1,D_2) = Beta(\alpha + r + r_2, \beta + N - r + N_2 - r_2) \to Beta(7,7)$

Predictive Densities

“Posterior Predictive Densities” in the textbook.
Given some training data $D_{train} = {x_1 \dots x_N}$, $p(D_{train}|\theta) \to $ Likelihood, assuming $x_i$ are conditionally independent given $\theta$.
Prior: $P(\theta|\pi)$
$\to p(\theta|D_{train}$ result is the posterior.

However, want to know $P(x_{new}|D_{train})$. Using the Law of Total Probability: $ = \int p(x_{new}, \theta | D_{train}) d\theta$

$$= \int p(x_{new} | \theta, x_1 \dots x_N) p(\theta|x_1 \dots x_N) d\theta$$

$$= \int p(x_{new} | \theta) p(\theta|x_1 \dots x_N) d\theta$$

= (prediction for x_new if \theta were the true value) (How likely the \theta value is, given the data)

This is in contrast to something like $P(x_{new}|\theta_{ML})$, because now we are using the entire posterior rather than only its mode.

Might not do this if the integral is too difficult, or if the data leads the prior to already be very “peaked” and well concentrated.

Gaussian Example

$\mathcal{N}(mu,\sigma^2$ with unknown mean and known variance. $p(\mu) \sim \mathcal{N}(\mu_0,\sigma^2)$.

To make predictions about new observations: $p(x_{new}|D_{train}) = \int p(x_{new}|\mu,\sigma^2) p(\mu|D_{train}) du $

$$ = \int \mathcal{N}(x_{new}| \mu, \sigma^2) \mathcal{N}(\mu | \mu_n, \sigma_n^2) du $$

$$ = \int \mathcal{N}(x_{new}| \mu_n, \sigma^2 + \sigma_n^2) du $$

The variance of this is the sum of the two sources of variance: one variance is known and based on the observed data. The other is the variance from the prior over the unknown mean of the data.

Bayesian Model Selection

Bayesian Model Selection
Check out
* Murphy text
section 5.3
* Barber Text available on class webpage.

== Midterm ==
Was around here.

==Classification==
Classification
===Logistic Regression==
Logistic Regression
Check out
* Murphy text p.245-251 on Newton-Raphson and related methods
* Murphy text p.261-266 on Stochastic gradient descent

===Neural Networks===
Neural Networks
Check out
* Murphy text
chapter 16.5
** chapter 28 on deep learning

Unsupervised Learning

Unsupervised Learning
* Expectation Maximization
Check out Murphy text
* p337-356 on Finite Mixture Models
* p363-365 on EM growth

Mixture Models

Mixture Models

Hidden Markov Models
* see p603-629 in the text, or chapter 17
* see p661 in the text for CRFs

Gibbs Sampling
* Check out collapsed method as Dirichlet mixture of Gaussians

Linear Regression

Linear Regression
In the Murphy text:
* See section 7.3
* See section 7.5 about fitting and time complexity
* See somewhere about the Bias-Variance tradeoff