Michael Data

Hidden Markov Models are a technique for dealing with sequence data.


HMMs are used for sequential data, but sometimes may also be applied to continuous data by discretizing the time component between features. Frequency within a window can be an observation.


The order is the number of dependencies for each node.

Observed Features

x is a d-dimensional vector from $D=\{\underling{x}_1,\dots,\underling{x}_T\}$, but each observation is dependent on a state defined within a discrete state space of size $k$. The state is hidden. We assume that x is conditionally independent of all other variables given its state. We also assume the states form a Markov chain.

$z_t$ is the state at time t. $z_t = \in {1,\dots K}$

States are assumed to have some type of transition model between them. Each state also requires an observation model.

The full joint is then $p(x_{1\dots T},z_{1\dots T}) = \prod_{t=1}^T p(x_t|x_t) p(z_t|z_{t-1}) $


  • * K emission densities or emission distributions $p(x_t|z_t=k,\theta_k)$
  • e.g. Gaussian: $\theta_k = \mu_k, \Sigma_k$
  • * $A$, a K-by-K Markov transition matrix
  • $A_{m,k}$ = entry for transition from m to k = $P(z_t=k|z_{t-1}=m)$

How to select k? Depends on the problem. May not be a “true” k at all.


HMMs have been the industry standard in speach recognition for decades. Still kind of state of the art in speach recognition. Also used in text processing and Bioinformatics.

Likelihood of Data

Given $\theta$ = $K$ emission density parameters and a transition matrix .
    *   L(\theta) = p(D|\theta) &= p(X_{1\dots T}|\theta)
\\ &= \sum_{z_{1\dots T}} p(x_{1,\dots,T} z_{1,\dots, T}|\theta) & \text{by LTP}

This sum is intractable to compute directly, $O(K^T)$. Use a graphical model factorization.

    *   \alpha_T(d) &= p(z_{T}=j,x_{1\dots T}) & \text{for $j=1\dots K$}
\\ &= \sum_{i} p(z_{T}=j,z_{T-1}=i,x_{1\dots T}) & \text{By LTP}
\\ &= \sum_i p(x_T|z_T = j) p(z_T = j|z_{T-1} = i) p(z_{T-1}|x_{1\dots T-1})& \text{By factorization}
\\ &= \sum_i p(x_T|z_T = j) p(z_T = j|z_{T-1} = i) \alpha_{T-1}(i)

Given $\alpha_{T-1}(i)$ for $i=1\dots K$, can compute $\alpha_T(d)$ in time $O\left(K^2 + K f(d)\right)$.

  • * $K^2$ for the transition matrix, sums over all $i,j$ pairs
  • * $f(d)$ is the likelihood part, $O(d^2)$ for Gaussian models

This is a nice recursive form that allows computing the full joint model in time $O\left(TK^2 + TK f(d)\right)$. This is known as the Forward Algorithm, the first step in the Forward-Backward Algorithm.

    *   p(z_t=j|x_{1\dots T}) &= ?
\\ &\propto p(Z_t =j,x_{1\dots t},x_{t+1,\dots ,T})
\\ &= p(x_{t+1\dots T}|z_t =j,x_{1\dots t}) p(z_t=j , x_{1\dots t}) & \text{By factorization}
\\ &= p(x_{t+1\dots T}|z_t =j) p(z_t=j , x_{1\dots t})
\\ &= p(x_{t+1\dots T}|z_t =j) \alpha_T(j)

Define $\beta_t(j) = p(x_{t+1 , \dots T}|z_t = j) \Rightarrow p(z_t = j | X_{1 \dots T}) \beta_t(j) \alpha_t(j)$. This can also be computed recursively in time $O\left(TK^2 + TK f(d)\right)$.

To compute $p(z_t=j|x_{1\dots T})$ for all $t,j$,

  1. compute $\alpha_t(j)$s in the forward step
  2. compute $\beta_t(j)$s in the backward step
  3. $p(z_t=j|x_{1\dots T}) = \frac{\alpha_t(j) \beta_t(j)}{\sum_k \alpha_t(k) \beta_t(k)}$

(This is similar to the E-step of EM)

$\alpha_T{d} = p(Z_T=d, X_{1 \dots T})$
Can compute $\alpha_T$ from $\alpha_{T-1}$ in time $O\left(K^2 + Kf(d)\right)$.
$p(z_T = d | X_{z \dots T}) = \frac{\alpha_T(d)}{\sum_{k=1}^K \alpha_T(k)}$

$w_t(d) = p(z_t = d | x_{1 \dots T}) \propto \alpha_t(d) \beta_t(d)$

In total, to compute $\alpha_t,\beta_t$ for $t= 1\dots T$ is in $O(TK^2 + TKf(d))$. This is the forward-backward algorithm.

Forward-Backward Algorithm

The typical method for doing inference with an HMM is the Forward-Backward Algorithm.

$P(z_{t} = j | x_{t}) \propto P(z_{t}=j,X_{1\dots T})$
$ = P(z_{t}=j, x_{1\dots t}, x_{t+1, \dots T})$
$ = P(x_{t+1, \dots T}| z_{t}=j, x_{1 \dots t}) P(z_{t=j}, x_{1 \dots t})$
Want to compute these two terms.

Forward Step

Consider the second term:
    *   P(z_{t=j}, x_{1 \dots t}) &= \sum_{z_{t-1}} p(z_{t=j}, z_{t-1}, x_{1, \dots t})
\\ &= \sum_{z_{t-1}} p(x_{t} | z_{t=j}) p(z_{t=j} | z_{t-1}) p(z_{t-1} , x_{1, \dots t-1})

Then you can recurse by using the result of the previous time-step, the transition matrix, and the current evidence to compute the term for the current step.

This is $O(K^2)$ work for all K values. This is known as the forward step in the forward-backward algorithm.

Backward Step

Consider the first term:
    *   P(x_{t+1, \dots T}| z_{t=j}) &= \sum_{Z_{t+1}} p(x_{t+1,\dots T}, Z_{t+1} | z_{t=j})
\\ &= \sum_{Z_{t+1}} p(x_{t+1, \dots T} | z_{t+1}, z_{t=j}) p(Z_{t+1} | z_{t=j})
\\ &= \sum_{Z_{t+1}} p(x_{t+1, \dots T} | z_{t+1}) p(Z_{t+1} | z_{t=j}) & \text{by C.I.}
\\ &= \sum_{Z_{t+1}} p(x_{t+2, \dots T} | z_{t+1}) p(x_{t+1} | Z_{t+1})  p(Z_{t+1} | z_{t=j})
Now this is another recursive solution which can be computed in $O(K)$ for fixed $j$, $O(K^2)$ for $j=1\dots K$.
This is known as the backward step in the forward-backward algorithm.

Learning a Model

Gibbs Sampling

May use Gibbs Sampling, but EM is generally faster.

Viterbi Algorithm

Sometimes used as an approximation to the forward-backward algorithm.

Using EM

In the E-step, we want to compute $p(Z_t=d|x_{1\dots T}) = w_t(d)$, a T-by-K membership weights matrix.
Do this using the F-B algorithm with current parameters $\theta$.
Also need:

  • * $E(N_j)$, the expected number of times in state $j$. $E(N_j) = \sum_{t = 1}^T w_t(d)$.
  • * $E(N_{ij})$, the expected number of transitions from $i$ to $j$.

    *   E(N_{ij}) &= \sum_{t=1}^{T-1} p(z_t=i,z_{t+1}=j|X_{t\dots T})
\\ &\propto \sum_{t=1}^{T-1} p(z_t=i,z_{t+1}=j,X_{1\dots T})
\\ &= p(X_{t+2\dots T}|z_{t+1}=j)p(X_{t+1}|z_{t+1}=j) p(z_{t+1}=j|z_t=i) p(z_t=i,X_{1\dots T})
\\ &= \Beta_{t+1}(j) p(X_{t+1}|z_{t+1}=j) A_{ij} \alpha_t(i)
\\ &= \Phi_t(i,j)
\\ p(z_t=i,z_{t+1}=j|x_{1\dots T}) &= \frac{\Phi_t(i,j)}{\sum_{k_1,k_2} \Phi_t(k_1,k_2)}

Now in the M-step,

  • * $\hat a_{ij} = \frac{E(N_{ij})}{E(N_j)}$
  • for a MAP-type estimate, may use some kind of smoothing constant here
  • * $\theta_k$ = emission distributions for each state
  • computed in a similar manner to the M-Step of EM for finite mixture models

Supervised Case

$z_t$ is known for some training data, but not at prediction time.

  • * For $\hat \theta_k$, group $x_t$s by $z_t$ values and do standard ML or MAP estimation.
  • * for $\hat A$, $\hat a_{ij} = \frac{n_{ij}}{n_i}$ (for ML estimate)

compute $p(z_{t=j} | x_{1,\dots T})$ using the forward-backward algorithm

Unsupervised Case

$z_t$ is unknown during training. $x_t$s are known.
Learning: Use the Expectation Maximization algorithm.

  1. In the E-step, given current/fixed parameters, compute $p(z_{t=j}|x_{1,\dots T})$ for $j=1\dots K$ and $t=1\dots T$

#* Get this from the F-B algorithm
#* Directly analogous to membership weights of a mixture model
#* $O(K^2 T)$ per iteration

  1. In the M-step, weight parameter estimates by membership weights of the E-step

#* $O(K^2 T)$ per iteration


Autoregressive HMMs also contain dependencies directly between observed variables. It is a bit more complicated, but has been found to be useful in time-series analysis.

In a Markov model, the run-lengths are geometric. There are self-transition probabilities and the probabilities over duration of states are geometric. A Semi-Markov extension can create non-geometric run lengths. When a state transition occurs, a run-length is drawn from the model distribution. This has the disadvantage that it is no longer Markov and the forward-backward algorithm becomes $O(T K^2 L_{max})$ where $L_{max}$ can be as large as $T$.

Conditional Random Fields: Model $p(z_t | x_{1\dots T}, y)$ rather than $p(z_t , x_{1\dots T},y)$, where $y$ represents some set of non-local non-time-dependent features. Widely used in language modeling where e.g. $y$ is sentence length. These are like Markov random fields in two dimensions. You usually need labeled data to train.

HMM structures with continuous state variables. e.g. now the vector $z_t$ includes velocity values in addition to position. In the case where $p(x_t|z_t)$ is Gaussian, this is known as a linear dynamical system and the equivalent of the forward-backward algorithm is known as the Kalman filter.

In computer vision, there is typically a 2-dimensional array of hidden states representing a scene. These are referred to as Markov Random Fields.

Numerical Issues

It is common to have underflow or overflow problems when computing the chain of likelihoods. Log-space is typically used, and a constant scale may also be carefully maintained in order to avoid these numerical problems.