Hidden Markov Models are a technique for dealing with sequence data.

HMMs are used for sequential data, but sometimes may also be applied to continuous data by discretizing the time component between features. Frequency within a window can be an observation.

The **order** is the number of dependencies for each node.

x is a d-dimensional vector from $D=\{\underling{x}_1,\dots,\underling{x}_T\}$, but each observation is dependent on a state defined within a discrete state space of size $k$. The state is hidden. We assume that x is conditionally independent of all other variables given its state. We also assume the states form a Markov chain.

$z_t$ is the state at time t. $z_t = \in {1,\dots K}$

States are assumed to have some type of `transition model`

between them. Each state also requires an `observation model`

.

The full joint is then $p(x_{1\dots T},z_{1\dots T}) = \prod_{t=1}^T p(x_t|x_t) p(z_t|z_{t-1}) $

- * K
`emission densities`

or`emission distributions`

$p(x_t|z_t=k,\theta_k)$ - e.g. Gaussian: $\theta_k = \mu_k, \Sigma_k$
- * $A$, a K-by-K Markov transition matrix
- $A_{m,k}$ = entry for transition from m to k = $P(z_t=k|z_{t-1}=m)$

How to select k? Depends on the problem. May not be a “true” k at all.

HMMs have been the industry standard in speach recognition for decades. Still kind of state of the art in speach recognition. Also used in text processing and Bioinformatics.

Given $\theta$ = $K$ emission density parameters and a transition matrix $$. <latex>\begin{align*}

- L(\theta) = p(D|\theta) &= p(X_{1\dots T}|\theta)

&= \sum_{z_{1\dots T}} p(x_{1,\dots,T} z_{1,\dots, T}|\theta) & \text{by LTP}
\end{align*}</latex>

This sum is intractable to compute directly, $O(K^T)$. Use a graphical model factorization.

<latex>\begin{align*}

- \alpha_T(d) &= p(z_{T}=j,x_{1\dots T}) & \text{for $j=1\dots K$}

&= \sum_{i} p(z_{T}=j,z_{T-1}=i,x_{1\dots T}) & \text{By LTP}

&= \sum_i p(x_T|z_T = j) p(z_T = j|z_{T-1} = i) p(z_{T-1}|x_{1\dots T-1})& \text{By factorization}

&= \sum_i p(x_T|z_T = j) p(z_T = j|z_{T-1} = i) \alpha_{T-1}(i)
\end{align*}</latex>

Given $\alpha_{T-1}(i)$ for $i=1\dots K$, can compute $\alpha_T(d)$ in time $O\left(K^2 + K f(d)\right)$.

- * $K^2$ for the transition matrix, sums over all $i,j$ pairs
- * $f(d)$ is the likelihood part, $O(d^2)$ for Gaussian models

This is a nice recursive form that allows computing the full joint model in time $O\left(TK^2 + TK f(d)\right)$. This is known as the **Forward Algorithm**, the first step in the Forward-Backward Algorithm.

<latex>\begin{align*}

- p(z_t=j|x_{1\dots T}) &= ?

&\propto p(Z_t =j,x_{1\dots t},x_{t+1,\dots ,T})

&= p(x_{t+1\dots T}|z_t =j,x_{1\dots t}) p(z_t=j , x_{1\dots t}) & \text{By factorization}

&= p(x_{t+1\dots T}|z_t =j) p(z_t=j , x_{1\dots t})

&= p(x_{t+1\dots T}|z_t =j) \alpha_T(j)
\end{align*}</latex>

Define $\beta_t(j) = p(x_{t+1 , \dots T}|z_t = j) \Rightarrow p(z_t = j | X_{1 \dots T}) \beta_t(j) \alpha_t(j)$. This can also be computed recursively in time $O\left(TK^2 + TK f(d)\right)$.

To compute $p(z_t=j|x_{1\dots T})$ for all $t,j$,

- compute $\alpha_t(j)$s in the forward step
- compute $\beta_t(j)$s in the backward step
- $p(z_t=j|x_{1\dots T}) = \frac{\alpha_t(j) \beta_t(j)}{\sum_k \alpha_t(k) \beta_t(k)}$

(This is similar to the E-step of EM)

$\alpha_T{d} = p(Z_T=d, X_{1 \dots T})$ Can compute $\alpha_T$ from $\alpha_{T-1}$ in time $O\left(K^2 + Kf(d)\right)$. $p(z_T = d | X_{z \dots T}) = \frac{\alpha_T(d)}{\sum_{k=1}^K \alpha_T(k)}$

$w_t(d) = p(z_t = d | x_{1 \dots T}) \propto \alpha_t(d) \beta_t(d)$

In total, to compute $\alpha_t,\beta_t$ for $t= 1\dots T$ is in $O(TK^2 + TKf(d))$. This is the forward-backward algorithm.

The typical method for doing inference with an HMM is the **Forward-Backward Algorithm**.

$P(z_{t} = j | x_{t}) \propto P(z_{t}=j,X_{1\dots T})$ $ = P(z_{t}=j, x_{1\dots t}, x_{t+1, \dots T})$ $ = P(x_{t+1, \dots T}| z_{t}=j, x_{1 \dots t}) P(z_{t=j}, x_{1 \dots t})$ Want to compute these two terms.

Consider the second term: <latex>\begin{align*}

- P(z_{t=j}, x_{1 \dots t}) &= \sum_{z_{t-1}} p(z_{t=j}, z_{t-1}, x_{1, \dots t})

&= \sum_{z_{t-1}} p(x_{t} | z_{t=j}) p(z_{t=j} | z_{t-1}) p(z_{t-1} , x_{1, \dots t-1})
\end{align*}</latex>

Then you can recurse by using the result of the previous time-step, the transition matrix, and the current `evidence`

to compute the term for the current step.

This is $O(K^2)$ work for all K values. This is known as the **forward step** in the **forward-backward algorithm**.

Consider the first term: <latex>\begin{align*}

- P(x_{t+1, \dots T}| z_{t=j}) &= \sum_{Z_{t+1}} p(x_{t+1,\dots T}, Z_{t+1} | z_{t=j})

&= \sum_{Z_{t+1}} p(x_{t+1, \dots T} | z_{t+1}, z_{t=j}) p(Z_{t+1} | z_{t=j})

&= \sum_{Z_{t+1}} p(x_{t+1, \dots T} | z_{t+1}) p(Z_{t+1} | z_{t=j}) & \text{by C.I.}

&= \sum_{Z_{t+1}} p(x_{t+2, \dots T} | z_{t+1}) p(x_{t+1} | Z_{t+1}) p(Z_{t+1} | z_{t=j})
\end{align*}</latex>
Now this is another recursive solution which can be computed in $O(K)$ for fixed $j$, $O(K^2)$ for $j=1\dots K$.
This is known as the **backward step** in the **forward-backward algorithm**.

May use Gibbs Sampling, but EM is generally faster.

Sometimes used as an approximation to the forward-backward algorithm.

In the E-step, we want to compute $p(Z_t=d|x_{1\dots T}) = w_t(d)$, a T-by-K membership weights matrix. Do this using the F-B algorithm with current parameters $\theta$. Also need:

- * $E(N_j)$, the expected number of times in state $j$. $E(N_j) = \sum_{t = 1}^T w_t(d)$.
- * $E(N_{ij})$, the expected number of transitions from $i$ to $j$.

<latex>\begin{align*}

- E(N_{ij}) &= \sum_{t=1}^{T-1} p(z_t=i,z_{t+1}=j|X_{t\dots T})

&\propto \sum_{t=1}^{T-1} p(z_t=i,z_{t+1}=j,X_{1\dots T})

&= p(X_{t+2\dots T}|z_{t+1}=j)p(X_{t+1}|z_{t+1}=j) p(z_{t+1}=j|z_t=i) p(z_t=i,X_{1\dots T})

&= \Beta_{t+1}(j) p(X_{t+1}|z_{t+1}=j) A_{ij} \alpha_t(i)

&= \Phi_t(i,j)

p(z_t=i,z_{t+1}=j|x_{1\dots T}) &= \frac{\Phi_t(i,j)}{\sum_{k_1,k_2} \Phi_t(k_1,k_2)}
\end{align*}
</latex>

Now in the M-step,

- * $\hat a_{ij} = \frac{E(N_{ij})}{E(N_j)}$
- for a MAP-type estimate, may use some kind of smoothing constant here
- * $\theta_k$ = emission distributions for each state
- computed in a similar manner to the M-Step of EM for finite mixture models

$z_t$ is known for some training data, but not at prediction time. Learning:

- * For $\hat \theta_k$, group $x_t$s by $z_t$ values and do standard ML or MAP estimation.
- * for $\hat A$, $\hat a_{ij} = \frac{n_{ij}}{n_i}$ (for ML estimate)

Prediction: compute $p(z_{t=j} | x_{1,\dots T})$ using the forward-backward algorithm

$z_t$ is unknown during training. $x_t$s are known. Learning: Use the Expectation Maximization algorithm.

- In the E-step, given current/fixed parameters, compute $p(z_{t=j}|x_{1,\dots T})$ for $j=1\dots K$ and $t=1\dots T$

#* Get this from the F-B algorithm #* Directly analogous to membership weights of a mixture model #* $O(K^2 T)$ per iteration

- In the M-step, weight parameter estimates by membership weights of the E-step

#* $O(K^2 T)$ per iteration

Autoregressive HMMs also contain dependencies directly between observed variables. It is a bit more complicated, but has been found to be useful in time-series analysis.

In a Markov model, the run-lengths are geometric. There are self-transition probabilities and the probabilities over duration of states are geometric. A Semi-Markov extension can create non-geometric run lengths. When a state transition occurs, a run-length is drawn from the model distribution. This has the disadvantage that it is no longer Markov and the forward-backward algorithm becomes $O(T K^2 L_{max})$ where $L_{max}$ can be as large as $T$.

Conditional Random Fields: Model $p(z_t | x_{1\dots T}, y)$ rather than $p(z_t , x_{1\dots T},y)$, where $y$ represents some set of non-local non-time-dependent features. Widely used in language modeling where e.g. $y$ is sentence length. These are like Markov random fields in two dimensions. You usually need labeled data to train.

HMM structures with continuous state variables. e.g. now the vector $z_t$ includes velocity values in addition to position. In the case where $p(x_t|z_t)$ is Gaussian, this is known as a **linear dynamical system** and the equivalent of the forward-backward algorithm is known as the Kalman filter.

In computer vision, there is typically a 2-dimensional array of hidden states representing a scene. These are referred to as Markov Random Fields.

It is common to have underflow or overflow problems when computing the chain of likelihoods. Log-space is typically used, and a constant scale may also be carefully maintained in order to avoid these numerical problems.