HMM Algorithms Explained
HMM Algorithms: Viterbi and Baum-Welch
In our article on HMMs we discussed what Hidden Markov Models are. In this article, we’ll discuss how you train them based on observations. The key concepts are the Viterbi algorithm and the Baum-Welch algorithm. This article will assume some familiarity with statistics, so mathphobic readers may choose to skip this one. We’ll work through the algorithms, some examples, and discuss some limitations.
1. The Hidden Markov Model (HMM)
At its core, a Hidden Markov Model is a statistical model used to describe systems that are believed to have a set of underlying “hidden” states, where the transitions between these states are probabilistic, and the observations we can see are also probabilistically dependent on the current hidden state.
The HMM is defined by five key components, which we can collectively represent as $\lambda = (A, B, \pi, S, O)$:
- $S$: A finite set of hidden states, $S = {s_1, s_2, …, s_N}$. These are the unobservable states we wish to infer.
- $O$: A finite set of possible observations or emissions, $O = {o_1, o_2, …, o_M}$. These are the visible outputs of the system.
- $A$: The state transition probability matrix, $A = {a_{ij}}$, where $a_{ij}$ is the probability of moving from state $s_i$ to state $s_j$. $a_{ij} = P(q_{t+1} = s_j | q_t = s_i)$
- $B$: The emission probability matrix, $B = {b_j(k)}$, where $b_j(k)$ is the probability of observing symbol $o_k$ while in state $s_j$. $b_j(k) = P(v_t = o_k | q_t = s_j)$
- $\pi$: The initial state probability distribution, $\pi = {\pi_i}$, where $\pi_i$ is the probability of the HMM being in state $s_i$ at time $t=1$. $\pi_i = P(q_1 = s_i)$
The fundamental assumptions of an HMM are:
- The Markov Assumption: The probability of a state at time $t$ depends only on the state at time $t-1$. This is a first-order Markov chain. $P(q_t | q_{t-1}, …, q_1) = P(q_t | q_{t-1})$.
- The Output Independence Assumption: The observation at time $t$ depends only on the state at time $t$. $P(v_t | q_t, q_{t-1}, …, q_1, v_{t-1}, …, v_1) = P(v_t | q_t)$.
These two assumptions simplify the model and enable the development of computationally efficient algorithms to solve the three core problems of HMMs:
- Evaluation: Given the model $\lambda$ and a sequence of observations $V = (v_1, …, v_T)$, what is the probability of observing that sequence? This is solved by the Forward-Backward procedure.
- Decoding: Given the model $\lambda$ and an observation sequence $V$, what is the most likely sequence of hidden states $Q = (q_1, …, q_T)$ that produced $V$? This is solved by the Viterbi algorithm.
- Learning: Given a set of observation sequences, how do we find the model parameters $\lambda = (A, B, \pi)$ that best fit the data? This is solved by the Baum-Welch algorithm.
The Viterbi algorithm addresses the decoding problem (prediction), while the Baum-Welch algorithm addresses the learning problem (smoothing). The two algorithms are fundamentally different in their goals and are often used sequentially: first, Baum-Welch to train a model, then Viterbi to use that model for prediction.
2. The Viterbi Algorithm: Decoding and State Prediction
The Viterbi algorithm is a dynamic programming algorithm that finds the most likely sequence of hidden states (the Viterbi path) that results in a given observation sequence. It solves the HMM decoding problem by efficiently avoiding the need to enumerate every possible state sequence, which would grow exponentially with the length of the observation sequence.
The Viterbi Process
The algorithm works by building a trellis (a grid representing time and states) and at each time step $t$, it computes the most probable path ending in each state.
Let’s define a variable $\delta_t(j)$ as the highest probability of a path of length $t$ ending in state $s_j$. $$\delta_t(j) = \max_{q_1,…,q_{t-1}} P(q_1, …, q_{t-1}, q_t = s_j, v_1, …, v_t | \lambda)$$
The recurrence relation is: $$\delta_t(j) = \left[ \max_{i} \delta_{t-1}(i) \cdot a_{ij} \right] \cdot b_j(v_t)$$
To reconstruct the path, we also need a backpointer variable, $\psi_t(j)$, which stores the state that maximized the probability at time $t-1$. $$\psi_t(j) = \arg\max_{i} [\delta_{t-1}(i) \cdot a_{ij}]$$
The process has three steps:
- Initialization: For each state $s_j$, calculate the probability of starting in that state and emitting the first observation $v_1$. $$\delta_1(j) = \pi_j \cdot b_j(v_1)$$
- Recursion: For each time step $t$ from 2 to $T$, and for each state $s_j$, compute $\delta_t(j)$ and $\psi_t(j)$ using the recurrence relations above.
- Termination & Backtracking: Find the state with the highest probability at the final time step $T$ and then use the backpointers to reconstruct the optimal path from $T$ back to 1. This gives us the most likely sequence of hidden states.
Simple Mathematical Example: State Prediction
Consider a simple HMM with two hidden states: “Sunny” ☀️ ($s_1$) and “Rainy” 🌧️ ($s_2$). Our observations are either “Walk” ($o_1$) or “Shop” ($o_2$).
- Initial Probabilities ($\pi$): $\pi_{Sunny} = 0.6$, $\pi_{Rainy} = 0.4$
- Transition Probabilities ($A$):
- $A_{Sunny \to Sunny} = 0.7$
- $A_{Sunny \to Rainy} = 0.3$
- $A_{Rainy \to Sunny} = 0.4$
- $A_{Rainy \to Rainy} = 0.6$
- Emission Probabilities ($B$):
- $B_{Sunny}(Walk) = 0.8$
- $B_{Sunny}(Shop) = 0.2$
- $B_{Rainy}(Walk) = 0.1$
- $B_{Rainy}(Shop) = 0.9$
Observation Sequence: $V = (\text{Walk}, \text{Shop})$
Step 1: Initialization (Time $t=1$)
- $\delta_1(\text{Sunny}) = \pi_{\text{Sunny}} \cdot B_{\text{Sunny}}(\text{Walk}) = 0.6 \cdot 0.8 = 0.48$
- $\delta_1(\text{Rainy}) = \pi_{\text{Rainy}} \cdot B_{\text{Rainy}}(\text{Walk}) = 0.4 \cdot 0.1 = 0.04$
- $\psi_1(\text{Sunny}) = \psi_1(\text{Rainy}) = 0$ (no previous state)
Step 2: Recursion (Time $t=2$)
- For State “Sunny”:
- Path from Sunny to Sunny: $\delta_1(\text{Sunny}) \cdot A_{\text{Sunny} \to \text{Sunny}} = 0.48 \cdot 0.7 = 0.336$
- Path from Rainy to Sunny: $\delta_1(\text{Rainy}) \cdot A_{\text{Rainy} \to \text{Sunny}} = 0.04 \cdot 0.4 = 0.016$
- $\delta_2(\text{Sunny}) = \max(0.336, 0.016) \cdot B_{\text{Sunny}}(\text{Shop}) = 0.336 \cdot 0.2 = 0.0672$
- $\psi_2(\text{Sunny}) = \arg\max(\text{from Sunny}, \text{from Rainy}) = \text{Sunny}$
- For State “Rainy”:
- Path from Sunny to Rainy: $\delta_1(\text{Sunny}) \cdot A_{\text{Sunny} \to \text{Rainy}} = 0.48 \cdot 0.3 = 0.144$
- Path from Rainy to Rainy: $\delta_1(\text{Rainy}) \cdot A_{\text{Rainy} \to \text{Rainy}} = 0.04 \cdot 0.6 = 0.024$
- $\delta_2(\text{Rainy}) = \max(0.144, 0.024) \cdot B_{\text{Rainy}}(\text{Shop}) = 0.144 \cdot 0.9 = 0.1296$
- $\psi_2(\text{Rainy}) = \arg\max(\text{from Sunny}, \text{from Rainy}) = \text{Sunny}$
Step 3: Termination & Backtracking
- The highest probability at $t=2$ is $\max(\delta_2(\text{Sunny}), \delta_2(\text{Rainy})) = \max(0.0672, 0.1296) = 0.1296$, which corresponds to the state Rainy.
- The most likely final state is $q_2 = \text{Rainy}$.
- We backtrack using the backpointers: $\psi_2(\text{Rainy}) = \text{Sunny}$.
- The most likely state sequence (the Viterbi path) is (Sunny, Rainy).
This demonstrates how Viterbi performs state prediction by finding the most probable underlying sequence of hidden states for a given observation sequence.
3. The Baum-Welch Algorithm: Parameter Learning and State Smoothing
The Baum-Welch algorithm is an Expectation-Maximization (EM) algorithm used to find the maximum likelihood estimates of the HMM parameters $(A, B, \pi)$ from an observation sequence when the hidden states are unknown. It is an iterative, two-step process that converges to a local optimum.
The algorithm uses the Forward-Backward procedure to compute the probabilities required for the parameter updates.
The Forward Procedure
Let $\alpha_t(i)$ be the probability of observing the partial sequence $(v_1, …, v_t)$ and being in state $s_i$ at time $t$. This is the “forward probability.” $$\alpha_t(i) = P(v_1, …, v_t, q_t = s_i | \lambda)$$ The recurrence relation is: $$\alpha_t(i) = \left[ \sum_{j=1}^{N} \alpha_{t-1}(j) \cdot a_{ji} \right] \cdot b_i(v_t)$$
The Backward Procedure
Let $\beta_t(i)$ be the probability of observing the remaining partial sequence $(v_{t+1}, …, v_T)$ given that we are in state $s_i$ at time $t$. This is the “backward probability.” $$\beta_t(i) = P(v_{t+1}, …, v_T | q_t = s_i, \lambda)$$ The recurrence relation is: $$\beta_t(i) = \sum_{j=1}^{N} a_{ij} \cdot b_j(v_{t+1}) \cdot \beta_{t+1}(j)$$
State Smoothing and Parameter Updates
The core of the Baum-Welch algorithm lies in using these forward and backward probabilities to compute expected values for state transitions and emissions. These expected values are then used to re-estimate the HMM parameters. This is where state smoothing comes in.
We can define two key variables:
- $\xi_t(i,j)$: The probability of being in state $s_i$ at time $t$ and state $s_j$ at time $t+1$, given the observation sequence and the model. $$\xi_t(i,j) = P(q_t = s_i, q_{t+1} = s_j | V, \lambda) = \frac{\alpha_t(i) \cdot a_{ij} \cdot b_j(v_{t+1}) \cdot \beta_{t+1}(j)}{\sum_{k=1}^{N}\sum_{l=1}^{N} \alpha_t(k) \cdot a_{kl} \cdot b_l(v_{t+1}) \cdot \beta_{t+1}(l)}$$
- $\gamma_t(i)$: The probability of being in state $s_i$ at time $t$ given the observation sequence and the model. This is the smoothed state probability. It is the ratio of the total probability of all paths going through state $s_i$ at time $t$ to the total probability of the entire observation sequence. $$\gamma_t(i) = P(q_t = s_i | V, \lambda) = \frac{\alpha_t(i) \beta_t(i)}{\sum_{j=1}^{N} \alpha_t(j) \beta_t(j)} = \sum_{j=1}^{N} \xi_t(i,j)$$
The $\gamma_t(i)$ value is called a smoothed probability because it takes into account information from the entire observation sequence (both before and after time $t$), not just information up to time $t$. This is in contrast to a filtered probability, which only uses data up to the current time. This smoothing is what allows the algorithm to re-estimate the parameters more accurately.
The re-estimation formulas for the new parameters are based on the expected counts:
- New Transition Probabilities ($A$): The expected number of transitions from state $s_i$ to $s_j$ divided by the expected number of transitions from state $s_i$. $$\bar{a}{ij} = \frac{\sum{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)}$$
- New Emission Probabilities ($B$): The expected number of times in state $s_j$ and observing symbol $o_k$ divided by the expected number of times in state $s_j$. $$\bar{b}j(k) = \frac{\sum{t=1}^{T} \mathbb{I}(v_t = o_k) \cdot \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)}$$ where $\mathbb{I}(\cdot)$ is the indicator function.
- New Initial Probabilities ($\pi$): The expected probability of being in state $s_i$ at time $t=1$. $$\bar{\pi}_i = \gamma_1(i)$$
This process is repeated iteratively: calculate $\alpha$ and $\beta$ using the current parameters (E-step), then re-estimate the parameters using the ξ
and γ
values (M-step). This continues until the log-likelihood of the observation sequence converges.
Due to the length and complexity of the iterative process, a simple numerical example for Baum-Welch is not practically feasible to present in a concise format. The process involves numerous matrix multiplications and summations over multiple time steps and iterations.
4. Viterbi vs. Baum-Welch: A Direct Comparison
Feature | Viterbi Algorithm | Baum-Welch Algorithm |
---|---|---|
Purpose | Decoding. Finds the single most likely sequence of hidden states. | Learning. Finds the HMM parameters $(A, B, \pi)$ that best fit the observed data. |
HMM Problem | Solves the Decoding Problem. | Solves the Learning Problem. |
Underlying Math | Dynamic Programming. It makes a series of locally optimal choices to find a globally optimal path. | Expectation-Maximization (EM). It uses an iterative, two-step procedure to find a local maximum of the likelihood function. |
Input | A trained HMM ($\lambda$) and an observation sequence ($V$). | A set of observation sequences ($V$) and an initial, often random, HMM ($\lambda_0$). |
Output | The single most probable hidden state sequence ($Q^*$). | A new, optimized HMM model ($\lambda_{new}$). |
Application | Prediction. Used to infer the hidden state sequence for a new, unseen observation. | Training. Used to train the HMM on historical data before it is used for prediction. |
State Estimation | State prediction at each time step. | State smoothing at each time step (using the full sequence). |
The relationship between the two algorithms is symbiotic: Viterbi requires a trained HMM, which is often a product of the Baum-Welch algorithm. An HMM is typically trained using Baum-Welch on a large corpus of data, and then that trained model is used with the Viterbi algorithm to decode new observation sequences.
5. Applications and Limitations of HMMs
HMMs, along with the Viterbi and Baum-Welch algorithms, have been foundational in several fields:
- Speech Recognition: The most famous application. HMMs model the temporal sequence of phonemes or words, with phonemes as hidden states and acoustic signals as observations.
- Bioinformatics: Used for gene finding and protein structure prediction, where states can represent different regions of DNA (e.g., coding vs. non-coding).
- Natural Language Processing (NLP): Used for tasks like part-of-speech tagging, where the states are grammatical parts of speech (e.g., noun, verb), and observations are words.
Despite their utility, HMMs have significant limitations, primarily stemming from their core assumptions:
- The First-Order Markov Assumption: The model assumes the current state depends only on the previous state. This is a simplification that fails to capture long-term dependencies, which are common in many real-world phenomena like human language or DNA sequences.
- Computational Complexity: The forward-backward procedure, while efficient for an HMM, still requires a number of operations proportional to $O(N^2 T)$, where $N$ is the number of states and $T$ is the length of the observation sequence. For very long sequences or many states, this can be computationally expensive.
- The Local Optimum Problem: As an EM algorithm, Baum-Welch is only guaranteed to find a local optimum, not a global one. This means the final model parameters are highly dependent on the initial random values, and different initializations can lead to different results.
In conclusion, HMMs provide a powerful framework for modeling sequential data, and the Viterbi and Baum-Welch algorithms are the key tools for using them for prediction and training, respectively. While more advanced models like Recurrent Neural Networks (RNNs) have surpassed HMMs in many areas, the principles and algorithms discussed here remain a cornerstone of probabilistic machine learning.
Sources Cited:
- Hidden Markov model - Wikipedia
- Everything You Need to Know When Assessing Hidden Markov Model Skills - Alooba
- What are hidden Markov models, and how are they used in time …
- Hidden Markov Models. Viterbi Algorithm | by Ellie Arbab - Medium
- Hidden Markov Model - University of Warwick
- Lecture 6: Case Studies: HMM and CRF 1 Hidden Markov Models (HMMs)
- Viterbi algorithm - Wikipedia
- HMM : Viterbi algorithm - a toy example - UPenn CIS
- A linear memory algorithm for Baum-Welch training - PMC - PubMed Central
- Derivation and implementation of Baum Welch Algorithm for Hidden …
- Viterbi Algorithm
- The Viterbi Algorithm
- 16.410 Lecture 21: Intro to Hidden Markov Models the Baum-Welch algorithm - MIT OpenCourseWare
- Derivation of Baum-Welch Algorithm for Hidden Markov Models - Stephen Tu
- Baum-Welch reestimation HMM Training
- Baum–Welch algorithm - Wikipedia
- Viterbi training or Baum-Welch algorithm to estimate the transition and emission probabilities? - Stack Overflow
- Comparative Study of the Baum-Welch and Viterbi Training Algorithms Applied to Read and Spontaneous Speech Recognition - ResearchGate
- Hidden Markov Models (HMMs) for Speech Recognition | by Amit Yadav | Biased-Algorithms
- Hidden Markov Model in Machine Learning - Applied AI Course