Hidden Markov Models are a technique for dealing with sequence data.

HMMs are used for sequential data, but sometimes may also be applied to continuous data by discretizing the time component between features. Frequency within a window can be an observation.

The **order** is the number of dependencies for each node.

x is a d-dimensional vector from , but each observation is dependent on a state defined within a discrete state space of size . The state is hidden. We assume that x is conditionally independent of all other variables given its state. We also assume the states form a Markov chain.

is the state at time t.

States are assumed to have some type of `transition model`

between them. Each state also requires an `observation model`

.

The full joint is then

- * K
`emission densities`

or`emission distributions`

- e.g. Gaussian:
- * , a K-by-K Markov transition matrix
- = entry for transition from m to k =

How to select k? Depends on the problem. May not be a “true” k at all.

HMMs have been the industry standard in speach recognition for decades. Still kind of state of the art in speach recognition. Also used in text processing and Bioinformatics.

Given = emission density parameters and a transition matrix .

This sum is intractable to compute directly, . Use a graphical model factorization.

Given for , can compute in time .

- * for the transition matrix, sums over all pairs
- * is the likelihood part, for Gaussian models

This is a nice recursive form that allows computing the full joint model in time . This is known as the **Forward Algorithm**, the first step in the Forward-Backward Algorithm.

Define . This can also be computed recursively in time .

To compute for all ,

- compute s in the forward step
- compute s in the backward step

(This is similar to the E-step of EM)

Can compute from in time .

In total, to compute for is in . This is the forward-backward algorithm.

The typical method for doing inference with an HMM is the **Forward-Backward Algorithm**.

Want to compute these two terms.

Consider the second term:

Then you can recurse by using the result of the previous time-step, the transition matrix, and the current `evidence`

to compute the term for the current step.

This is work for all K values. This is known as the **forward step** in the **forward-backward algorithm**.

Consider the first term:

Now this is another recursive solution which can be computed in for fixed , for .

This is known as the **backward step** in the **forward-backward algorithm**.

May use Gibbs Sampling, but EM is generally faster.

Sometimes used as an approximation to the forward-backward algorithm.

In the E-step, we want to compute , a T-by-K membership weights matrix.

Do this using the F-B algorithm with current parameters .

Also need:

- * , the expected number of times in state . .
- * , the expected number of transitions from to .

Now in the M-step,

- *
- for a MAP-type estimate, may use some kind of smoothing constant here
- * = emission distributions for each state
- computed in a similar manner to the M-Step of EM for finite mixture models

is known for some training data, but not at prediction time.

Learning:

- * For , group s by values and do standard ML or MAP estimation.
- * for , (for ML estimate)

Prediction:

compute using the forward-backward algorithm

is unknown during training. s are known.

Learning: Use the Expectation Maximization algorithm.

- In the E-step, given current/fixed parameters, compute for and

#* Get this from the F-B algorithm

#* Directly analogous to membership weights of a mixture model

#* per iteration

- In the M-step, weight parameter estimates by membership weights of the E-step

#* per iteration

Autoregressive HMMs also contain dependencies directly between observed variables. It is a bit more complicated, but has been found to be useful in time-series analysis.

In a Markov model, the run-lengths are geometric. There are self-transition probabilities and the probabilities over duration of states are geometric. A Semi-Markov extension can create non-geometric run lengths. When a state transition occurs, a run-length is drawn from the model distribution. This has the disadvantage that it is no longer Markov and the forward-backward algorithm becomes where can be as large as .

Conditional Random Fields: Model rather than , where represents some set of non-local non-time-dependent features. Widely used in language modeling where e.g. is sentence length. These are like Markov random fields in two dimensions. You usually need labeled data to train.

HMM structures with continuous state variables. e.g. now the vector includes velocity values in addition to position. In the case where is Gaussian, this is known as a **linear dynamical system** and the equivalent of the forward-backward algorithm is known as the Kalman filter.

In computer vision, there is typically a 2-dimensional array of hidden states representing a scene. These are referred to as Markov Random Fields.

It is common to have underflow or overflow problems when computing the chain of likelihoods. Log-space is typically used, and a constant scale may also be carefully maintained in order to avoid these numerical problems.