Streaming hidden Markov models

Posted on Apr 10, 2026
tl;dr: A one-dimensional introduction on streaming hidden Markov models (HMMs) for regime detection and prediction.

tags: regimes, HMM, Bayes, sequential, non-stationary

Contents

Introduction

A classical method to tackle the problem of time-series regime detection and segmentation is the hidden Markov model (HMM). 1 At the core of the HMM are two sub-models: a model for latent regimes and a predictive model for observations (conditioned on a regime).

In this post, we consider the problem of learning a parametric HMM when neither the regimes nor the model parameters are known. In this case, estimating the observation model parameters requires knowing which datapoints belong to which regime. However, knowing which datapoint belong to a given regime requires knowledge of the parameters for the model, which are unknown. This, at first glance, looks like a circular problem.

Here, we present a streaming (online, recursive) approach to tackle this problem. We show that an HMM can be built to (i) detect regimes, (ii) estimate model parameters for each regime, and (iii) make one-step-ahead predictions, all in an online way—one datapoint at a time. This setup resembles what one would observe in real-time problems where decisions must be made using only the available information and with minimal delay.

The formulation of the streaming HMM in this post is based on the BONE framework 2 for online learning and forecasting in non-stationary environments.

For code, see https://github.com/gerdm/streaming-hmm

Problem formulation

Suppose we are presented a sequence of observations in a stream (one at a time). We denote by $y_t \in \reals$ the observation at time $t$ and by $$ \cY_t \triangleq (y_1,\,y_2,\,\ldots,\,y_t) \subseteq \reals^t $$ the sequence of observations observed up to time $t$.

We assume that each observation $y_t$ belongs to one of $K \geq 2$ regimes. To represent this, we pair each observation $y_t$ with an unobserved (latent) variable $z_t \in \cK = \{1, \ldots, K\}$ that specifies the regime for $y_t$.

Next, the regime path up to time $t$ is noted by $$ \cZ_t \triangleq (z_1, \ldots, z_t) \subseteq \cK^t. $$ Finally, we assume that there exists a set of disjoint model parameters $\Theta_K = (\theta_1, \ldots, \theta_K)$—one for each regime— such that the probability of an observation $y_t$, conditioned on $z_t$ and $\Theta_K$, is only dependent on $\theta_{z_t}$. That is, $$ p(y_t \mid z_t, \Theta_K) = p(y_t \mid \theta_{z_t}). $$

With the assumptions above, we obtain $$ \boxed{ \begin{aligned} p(\Theta_K) &= \prod_{k=1}^K p(\theta_k),\\ p(\cY_t \mid \cZ_t, \Theta_K) &= \prod_{\tau=1}^t p(y_\tau \mid \theta_{z_\tau}). \end{aligned} } $$

The best possible forecast is done if all three quantities are known at each step: $$ (y_1, z_1, \theta_{z_1}),\,(y_2, z_2, \theta_{z_2}),\,\ldots,\,(y_T, z_T, \theta_{z_T}). $$

Our goals are

  1. to determine the expected path given the observations $\E[\cZ_t \mid \cY_t]$,
  2. to determine the model parameters at each step $\E[\Theta_K \mid \cY_t]$, and
  3. to make $k$-step ahead forecasts given the observations up to time $t$ $p(y_{t+k} \mid \cY_t)$.

From a Bayesian perspective, tackling all problems above requires estimation of the joint posterior density over paths and model parameters:

$$\tag{2} p(\Theta_K,\,\cZ_t \mid \cY_t) = \underbrace{ p(\cZ_t \mid \cY_t) }_\text{path posterior} \; \underbrace{ p(\Theta_K \mid \cZ_t, \cY_t). }_\text{parameter posterior} $$

In this post, we present an algorithm to obtain recursive estimates of $({\rm 2})$. First, we present a Bayesian way to estimate $\Theta_K$ assuming that the path $\cZ_t$ is known. Next, we show how to make estimates for the path posterior. Finally, we put all these ideas together and show how the HMM can be used for forecasting.

One-dimensional Gaussian model

Throughout this post, we model an observation $y_t$, conditioned on regime $z_t$, as Gaussian with known variance $\sigma^2 > 0$ and unknown mean $\theta_{z_t} \in \reals$. Next, the transition probability between any path $\cZ_{t-1}$ and $z_t$ is Markovian and given by $$ p(z_t \mid \cZ_{t-1}) = p(z_t \mid z_{t-1}) = \pi_{z_{t-1}, z_{t}} \in (0,1). $$

This model can be written as a one-dimensional linear state-space model (SSM) of the form $$ \tag{1} \boxed{ \begin{aligned} p(z_t \mid z_{t-1}) &= \pi_{z_{t-1},\,z_{t}},\\ p(y_t \mid \Theta_K, z_t) &= \normal(y_t \mid \theta_{z_t},\,\sigma^2). \end{aligned} } $$

A run of this process with $K=3$ is shown in the figure below.

hidden-markov-model sample

Bayesian learning

Suppose we are given a sequence of observations $\cY_t$ and their corresponding regimes $\cZ_t$ (for instance, through domain expertise). However, we assume that the model parameters $\Theta_K$ are unknown. In this case, we are interested in the parameter posterior $$ p(\Theta_K \mid \cY_t, \cZ_t). $$

Before seeing any data, at $t=0$, we construct a prior belief $$ p(\Theta_K) = \prod_{k=1}^K p(\theta_k) $$

At time $t=1$, we get an observation $y_1$ in regime $z_1$. The posterior over parameters is $$ \begin{aligned} p(\Theta_K \mid z_{1}, y_{1}) &\propto p(\Theta_K)\;p(y_1 \mid z_1, \Theta_K)\\ &= \left(\prod_{k=1}^K p(\theta_k)\right)\;p(y_1 \mid \theta_{z_1})\\ &= \left(\prod_{k\neq z_1} p(\theta_k)\right)\; \Big(p(\theta_{z_1})\;p(y_1 \mid \theta_{z_1})\Big) \end{aligned} $$

At time $t=2$, we get the observation $y_2$ in regime $z_2$. The posterior takes one of two forms $$ \begin{aligned} &p(\Theta_K \mid z_{1}, y_{1})\\ &= \begin{cases} \left(\prod_{k\neq z_1,z_2} p(\theta_k)\right)\; \Big(p(\theta_{z_1})\;p(y_1 \mid \theta_{z_1})\Big) \Big(p(\theta_{z_2})\;p(y_2 \mid \theta_{z_2})\Big) & z_1 \neq z_2,\\ \left(\prod_{k\neq z_1} p(\theta_k)\right)\; \Big(p(\theta_{z_2})\;p(y_1 \mid \theta_{z_2})\;p(y_2 \mid \theta_{z_2}))\Big) & z_1 = z_2. \end{cases} \end{aligned} $$

At $t = 3$, we obtain $$ \begin{aligned} &p(\Theta_K \mid z_{1}, y_{1}, z_{2}, y_{2}, z_{3}, y_{3})\\ &= \begin{cases} \Big(p(\theta_{z_1})\;p(y_1\mid\theta_{z_1})\Big) \Big(p(\theta_{z_2})\;p(y_2\mid\theta_{z_2})\Big) \Big(p(\theta_{z_3})\;p(y_3\mid\theta_{z_3})\Big) & z_1\neq z_2\neq z_3,\\[10pt] \left(\prod_{k\neq z_1,z_3} p(\theta_k)\right)\; \Big(p(\theta_{z_1})\;p(y_1\mid\theta_{z_1})\;p(y_2\mid\theta_{z_1})\Big) \Big(p(\theta_{z_3})\;p(y_3\mid\theta_{z_3})\Big) & z_1 = z_2 \neq z_3,\\[10pt] \left(\prod_{k\neq z_1,z_2} p(\theta_k)\right)\; \Big(p(\theta_{z_1})\;p(y_1\mid\theta_{z_1})\;p(y_3\mid\theta_{z_1})\Big) \Big(p(\theta_{z_2})\;p(y_2\mid\theta_{z_2})\Big) & z_1 = z_3 \neq z_2,\\[10pt] \left(\prod_{k\neq z_1,z_2} p(\theta_k)\right)\; \Big(p(\theta_{z_2})\;p(y_2\mid\theta_{z_2})\;p(y_3\mid\theta_{z_2})\Big) \Big(p(\theta_{z_1})\;p(y_1\mid\theta_{z_1})\Big) & z_2 = z_3 \neq z_1,\\[10pt] \left(\prod_{k\neq z_1} p(\theta_k)\right)\; \Big(p(\theta_{z_1})\;p(y_1\mid\theta_{z_1})\;p(y_2\mid\theta_{z_1})\;p(y_3\mid\theta_{z_1})\Big) & z_1 = z_2 = z_3. \end{cases} \end{aligned} $$

For $t \geq 1$, the posterior over parameter regimes is disjoint and takes the form $$ p(\Theta_K \mid \cZ_t, \cY_t) = \prod_{k=1}^K \left(\prod_{\tau=1}^t p(\theta_k)\,p(y_\tau \mid z_\tau)^{\delta(k - z_\tau)}\right), $$ where $$ \delta(x) = \begin{cases} 1 & \text{if } x = 0,\\ 0 & \text{if } x\neq 0. \end{cases} $$ This shows that the parameter posterior for the $k$-th regime only depends on the observations whose regime is $k$. We can think of this as training $K$ distinct models, where a subset of the data $\cY_{t,k}$ is used to train the $k$-th model.

To denote this more compactly, let $$ \cY_{t,k} = \{y_\tau \colon z_\tau = k,\; \tau \leq t\} $$ be the subset of observations in $\cY_t$ selected according to the regimes in $\cZ_t$. 3 We denote the joint parameter posterior as $$ \begin{aligned} p(\Theta_K \mid \cZ_t, \cY_t) &= \prod_{k=1}^K p(\theta_k \mid \cY_{t,k}). \end{aligned} $$

Conjugate Gaussian updates

Suppose observations $\cY_t$ and path $\cZ_{t}$ are given, with $ \cZ_{t} = \cZ_{t-1} \cup \{k\}, $ and $k\in \{1, \ldots, K \}$. We are interested in the $k$-th regime posterior $$ p(\theta_k \mid \cY_{t,k}) $$

Considering a prior $p(\theta_k) = \normal(\theta_k \mid m_0,\,s_0^2)$, it follows that the posterior after a single observation (belonging to $k$) is Gaussian. This follows by conjugacy.

In general, the updates are recursive and take the form $$ p(\theta_k \mid \cY_{t,k}) \propto p(\theta_k \mid \cY_{t-1,k})\;p(y_t \mid \theta_k) \propto \normal(\theta_k \mid m_{t,k},\,s_{t,k}^2), $$ with $$ \tag{3} \boxed{ \begin{aligned} m_{t,k} &= \alpha_t\;y_t + (1 - \alpha_t)\;m_{t-1,k},\\ s_{t,k}^2 &= (1 - \alpha_t)\;s_{t-1,k}^2,\\ \alpha_t &= \frac{s_{t-1,k}^2}{s_{t-1,k}^2 + \sigma^2}. \end{aligned} } $$

Example: recursive Bayesian update

The figure below shows the mean posterior estimate of the model parameters (around two standard deviations) for the dataset shown in the Figure above.

hidden Markov posterior mean estimate with known regimes

Posterior predictive

With the $k$-th parameter posterior at time $t$, a forecast about the next step is done through the posterior predictive $$ \tag{4} \begin{aligned} p(y_{t+1} \mid \cY_{t,k}) &= \int p(\theta_k,\,y_{t+1} \mid \cY_{t,k})\,\d\theta_k\\ &= \int p(\theta_k \mid \cY_{t,k})\;p(y_{t+1} \mid \theta_k)\,\d\theta_k\\ &= \int \normal(\theta_k \mid m_{t,k},\,s_{t,k}^2)\;\normal(y_t \mid \theta_k, \sigma^2)\,\d\theta_k\\ &= \normal(y_{t+1} \mid m_{t,k},\,s_{t,k}^2 + \sigma^2), \end{aligned} $$ where the last equality follows from basic properties of Gaussians.

As we will shortly see, the posterior predictive is an important component in determining the regime for a datapoint.

Remark

The posterior predictive considered in this example is a function of the posterior mean and variance, which can themselves be updated recursively. However, in the pseudocode below, we consider the more general form of the posterior predictive, i.e., $p(y_{t+1} \mid \cY_{t,k})$ and denote the pseudocode for the log-posterior predictive as logp_predictive(y, Y_subset), which implies reconstruction of the posterior predictive given past data and current test datapoint. This notation implies that the posterior mean and variance get reconstructed after every new observation, rather than updating the moments recursively following $({\rm 3})$ and using the posterior predictive $({\rm 4})$.

In figures shown below, we do update the moments recursively, but we consider this notation for notational simplicity.

Bayesian regime detection: the path posterior

Our next task is to infer the path $\cZ_t$ that generated the data $\cY_t$. This is the classical problem of regime identification or regime detection. In this scenario, the quantity of interest is the path posterior $$ p(\cZ_t \mid \cY_t) = p(\cZ_{t-1},\,z_t \mid \cY_t). $$ Starting with the initial condition (prior) $p(\cZ_0 \mid \cY_0) = p(z_0)$ for $z_0 \in \{0, \ldots, K\}$, the path posterior at time $t$ takes the form $$ \begin{aligned} p(\cZ_t \mid \cY_t) &= p(\cZ_t \mid \cY_{t-1}, y_t)\\ &\propto p(\cZ_t\mid \cY_{t-1}) \; p(y_t \mid \cZ_t, \cY_{t-1}) \\ &= p(\cZ_{t-1} \mid \cY_{t-1})\; p(z_t \mid \cZ_{t-1}, \cY_{t-1})\; p(y_t \mid \cZ_t,\,\cY_{t-1}) \\ &= \underbrace{ p(\cZ_{t-1} \mid \, \cY_{t-1}) }_{\text{path posterior at } t-1}\;\; \underbrace{ p(z_t \mid z_{t-1}) }_\text{transition probability}\;\; \underbrace{ p(y_t \mid \cY_{t-1,\,z_t}) }_{\text{posterior predictive}}. \end{aligned} $$

The equation above shows that estimation of the path posterior at time $t$ requires three sources of information:

  1. the posterior over paths at the previous timestep,
  2. the transition probability from the previous regime $z_{t-1}$ to the next regime $z_t$, and
  3. the posterior predictive density under regime $z_{t}$, evaluated at $y_t$.

In practice, one uses the log-path posterior to avoid numerical underflow. This takes the form $$ \log p(\cZ_t \mid \cY_t) = \log p(\cZ_{t-1} \mid \, \cY_{t-1}) + \log p(z_t \mid z_{t-1}) + \log p(y_t \mid \cY_{t-1,\,z_t}) + {\rm Cst.} $$

Pseudocode of the posterior update is shown in the code snippet below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def log_path_posterior(Z, logP_prev, logP_transition, Y):
    regime_next = Z[-1]
    regime_prev = Z[-2]

    y_next = Y[-1]
    Y_subset = [yi for yi in Y[:-1] if yi == regime_next]

    p_transition = log_P_transition[regime_prev, regime_next]
    log_p_predictive = logp_predictive(y_last, Y_subset)
    log_p_new = logP_prev + p_transition + log_p_predictive
    return log_p_new

Constructing the full path posterior

The path posterior is estimated recursively and is computed for all possible paths.

More precisely, we start by evaluating $p(z_0)$ for all $z_0 \in \{0, \ldots, K\}$. Then, given the observation $y_1$, we define $\cZ_1 = \{z_0, \, z_1 \} \in K^2$ and evaluate $$ \log p(\cZ_1 \mid \cY_1) = \log p(z_0) + \log p(z_1 \mid z_0) + \log p(y_1 \mid \cY_{0,z_1}) + {\rm cst}. $$ Because $\cZ_1 \in K^2$, the path posterior is evaluated in all $K^2$ possible permutations. This is continued for $t \geq 3$ in a similar way.

In pseudocode, computation of the log-path posterior at time $t+1 \geq 2$ takes the form

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def path_posterior_full_update(
    log_PZ_prev: Array[float, (K^t,)],
    P_transition: Array[float, (K, K)],
    Y: list[float],
    y_new: float,
) -> Array[float, (K^(t+1),)]:
    t = len(Y)
    K = shape(P_transition)[0]

    log_PZ_new: Array[float, (K^(t+1))] = zeros((K, K, ..., K))
    logP_transition = log(P_transition)

    for k0 in range(K):
        ...
        for kt in range(K):
            logP_prev = log_PZ_prev[k0, ..., kt]
            for ktp1 in range(K):
                Z_new = [k0, ..., kt, ktp1]
                log_pi = logP_transition[kt, ktp1]
                log_p_new = log_path_posterior(Z_new, logP_prev, log_pi, Y, y_new)
                log_PZ_new[k0, ..., kt, ktp1] = log_p_new

    return log_PZ_new

Clearly, computing the posterior over paths is computationally infeasible for moderately large $t$— if there are $K$ possible regimes, then there are $K^t$ possible paths, which are evaluated using $t$ for loops of size $K$.

Constructing the approximate path posterior

To approximate the path posterior, we keep track of only a subset of $S \geq 1$ most probable paths. We use these paths and their unnormalised probability to define a new approximate path posterior.

Formally, let $\{\cZ_t^{(s)}\}_{s=1}^S$ be a subset of paths at time $t$ with weights $\{w_t^{(s)}\}_{s=1}^S$ such that $w_t^{(s)} \in (0,1)$ is the weight for path $\cZ_t^{(s)}$, $\sum_{s=1}^S w_t^{(s)} = 1$, and initial condition $w_0^{(s)} = 1/S$ for all $s$. We use the collection of paths and weights to define the path posterior empirical probability mass function (PMF): $$ \nu_t(\cZ_t) \triangleq \sum_{s=1}^S w_{t}^{(s)}\,\delta \left(\cZ_t - \cZ_t^{(s)}\right) \approx p(\cZ_t \mid \cY_t). $$

Note that, $\nu_t: {\cal K}^t \to [0,1]$.

Under this approximation, only the subset of paths $\{\cZ_{t}^{(s)}\}_{s=1}^S$ are allowed to have non-zero probability. All other paths are assigned zero probability.

For example, suppose $t=3$, $S=4$, with paths and weights $$ \begin{aligned} Z_3^{(1)} &= \{1,\, 1, \, 1\}, & w_3^{(1)} &= 0.7,\\ Z_3^{(2)} &= \{1,\, 1, \, 2\}, & w_3^{(2)} &= 0.2,\\ Z_3^{(3)} &= \{1,\, 3, \, 2\}, & w_3^{(3)} &= 0.05,\\ Z_3^{(4)} &= \{3,\, 1, \, 2\}, & w_3^{(4)} &= 0.05.\\ \end{aligned} $$ Then, $$ \begin{aligned} \nu_3(\{1,\,1,\,2\}) &= 0.2,\\ \nu_3(\{3,\,1,\,2\}) &= 0.05,\\ \nu_3(\{3,\,1,\,1\}) &= 0.0,\\ \nu_3(\{3,\,2,\,1\}) &= 0.0. \end{aligned} $$

Exploring new regimes: from $S$ to $S\times K$

We now turn to the problem of selecting $S$ paths at each step and finding their corresponding weight.

Suppose $\nu_{t-1}(\cZ_{t-1})$ is given, and we are presented with a new observation $y_t$. The approximate path posterior PMF $p(\cZ_t \mid \cY_t)$ is approximated by

$$ \begin{aligned} p(\cZ_t \mid \cY_t) &\propto {\color{orange} p(\cZ_{t-1} \mid \cY_{t-1})}\; p(z_t \mid z_{t-1})\; p(y_t \mid \cY_{t-1, z_{t}})\\ &\approx {\color{orange} \nu_{t-1}(\cZ_{t-1})}\; p(z_t \mid z_{t-1})\; p(y_t \mid \cY_{t-1, z_{t}})\\ &= {\color{orange} \left( \sum_{s=1}^S w_{t-1}^{(s)}\,\delta \left(\cZ_{t-1} - \cZ_{t-1}^{(s)}\right) \right)}\; p(z_t \mid z_{t-1})\; p(y_t \mid \cY_{t-1, z_{t}})\\ &= \sum_{s=1}^S w_{t-1}^{(s)}\; p(z_t \mid z_{t-1}^{(s)})\; p(y_t \mid \cY_{t-1, z_{t}}^{(s)})\; \delta\left(\cZ_{t-1} - \cZ_{t-1}^{(s)}\right), \end{aligned} $$ with $$ \cY_{t-1, z_t}^{(s)} = \{y_\tau \mid z_\tau^{(s)} = z_t,\; \tau \leq t-1\} $$ the partition of observations in $\cY_{t}$, assigned to the particle-specific path $\cZ_t^{(s)}$. Thus, the posterior predictive takes the form $$ p(y_{t+1} \mid \cY_{t, k}^{(s)}) = \normal(y_{t+1} \mid m_{t,k}^{(s)},\,(s_{t,k}^2)^{(s)} + \sigma^2). $$ Here, $m_{t,k}^{(s)}$ and $(s_{t,k}^2)^{(s)}$ are the posterior mean and variance for the $k$-th regime at time $t$ estimated using datapoints from $\cY_{t,k}^{(s)}$.

The equation above shows that updating the approximate path posterior increases the search space from $S$ to $S\times K$. To see this, recall that $\cZ_t = \cZ_{t-1} \cup \{z_t\}$. Then,

$$ \begin{aligned} &p(\cZ_t \mid \cY_t)\\ &=p(\cZ_{t-1},\,z_t \mid \cY_t)\\ &\approx \sum_{k=1}^K \sum_{s=1}^S w_{t-1}^{(s)}\; p(k \mid z_{t-1}^{(s)})\; p(y_t \mid \cY_{t-1, k}^{(s)})\; \delta\left(\cZ_{t-1} - \cZ_{t-1}^{(s)}\right), \delta(z_t - k)\\ &= \sum_{k=1}^K \sum_{s=1}^S \hat{w}_t^{(s,k)}\; \delta\left(\cZ_{t-1} - \cZ_{t-1}^{(s)}\right)\; \delta(z_t - k), \end{aligned} $$ where $$ \hat{w}_t^{(s,k)} = w_{t-1}^{(s)}\; p(k \mid z_{t-1}^{(s)})\; p(y_t \mid \cY_{t-1, k}^{(s)})\; $$ is the unnormalised weight for path $\cZ_{t-1}^{(s)}$ that takes $z_t = k$ as the next regime. Thus, at time $t$, we obtain a collection of $S\times K$ weights $$ \begin{bmatrix} \hat{w}_t^{(1, 1)} & \hat{w}_t^{(1, 2)} & \cdots & \hat{w}_t^{(1, K)} \\ \hat{w}_t^{(2, 1)} & \hat{w}_t^{(2, 2)} & \cdots & \hat{w}_t^{(2, K)} \\ \vdots & \vdots & \ddots & \vdots \\ \hat{w}_t^{(S, 1)} & \hat{w}_t^{(S, 2)} & \cdots & \hat{w}_t^{(S, K)} \end{bmatrix}. $$ Note that $$ \sum_{k=1}^K\sum_{s=1}^S \hat{w}_t^{(s,k)} = 1. $$

Beam search: From $S\times K$ to $S$

Having evaluated all $S\times K$ possible paths, we now reduce the collection of paths from $S \times K$ back to $S$. To do this, we simply sort paths by their weight value, keep the $S$ paths with highest weight, and drop the rest of the paths. This strategy is known as beam search. It is usually discouraged as it can bias the result by limiting exploration, and it requires the evaluation of $S \times K$ elements at each step. However, it provides a simple heuristic to find approximate PMFs.

In pseudoalgorithm, beam search is as follows:

  1. Evaluate $\hat{w}_t^{(s, z_t)}$ for all $s \in \{1,\ldots, S\}$ and $z_t \in \{1,\ldots,K\}$,
  2. keep the top $S$ paths with highest unnormalized weights, and
  3. normalize the weights.

Thus, our approximate posterior PMF is $$ \nu_t(\cZ_t) \triangleq \sum_{s=1}^S w_t^{(s)}\; \delta\left(\cZ_t - \cZ_t^{(s)}\right), $$ where $\{\cZ_t^{(s)}\}_{s=1}^S$ are the top $S$ paths and $$ \begin{aligned} \hat{w}_t^{(s)}&= w_{t-1}^{(s)}\; \,\pi_{z_t^{(s)},z_{t-1}^{(s)}} \; p(y_t \mid \cZ_t^{(s)},\,\cY_{t-1}),\\ w_t^{(s)} &\triangleq \frac{\hat{w}_t^{(s)}}{\sum_k \hat{w}_t^{(k)}}, \end{aligned} $$ are the normalised weights.

In pseudocode, computation of the log-path posterior using this strategy at time $t+1 \geq 2$ takes the form

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def path_posterior_beam_update(
    paths: Array[int, (S, t)],
    log_PZ_prev: Array[float, (S,)],
    P_transition: Array[float, (K, K)],
    Y: list[float],
    y_new: float,
) -> tuple[ Array[float, (S,)], Array[int, (S, t+1)]]:

    S, t = shape(paths)
    K = shape(P_transition)[0]

    # Unknown terms
    log_PZ_new = zeros((S, K))
    paths_new = zeros((S, K, t+1))

    # Evaluate collection of path posteriors
    for s, path in enumerate(paths):
        logp_prev = log_PZ_prev[s]
        k_prev = path[-1]
        for k in range(K):
            path_new = [*path, k]
            log_pi = log(P_transition[k_prev, k])
            log_p_new = log_path_posterior(path_new, logp_prev, log_pi, Y, y_new)
            log_PZ_new[s, k] = log_p_new
            paths_new[s, k] = path_new

    Y_new = [*Y, y_new]

    # prune candidates down to S best
    log_PZ_new, paths_new = prune_top_S(log_PZ_new, paths_new)  

    # normalize the retained S log-posteriors: exp(log_PZ_new).sum() == 1
    log_PZ_new = log_PZ_new - logsumexp(log_PZ_new)

    return log_PZ_new, paths_new

where prune_top_S is defined as

1
2
3
4
5
6
7
8
9
def prune_top_S(
    log_PZ_grid: Array[float, (S, K)],
    paths_grid: Array[int, (S, K, t+1)]
) -> tuple[Array[float, (S,)], Array[int, (S, t+1)]]:
    # flatten, pick top-S, return (S,) log-posteriors and (S, t+1) paths
    log_flat = reshape(log_PZ_grid, (S*K,))
    paths_flat = reshape(paths_grid,  (S*K, t+1))
    ix_topS = argsort(log_flat, descending=True)[:S]
    return log_flat[ix_topS], paths_flat[ix_topS]

For alternative Bayesian approaches, see Rao-Blackwellised particle filter (RBPF), 4 or Particle learning (PL). 5

Example: regime detection for parameter estimation

In our setup, the main importance of regime detection is for regime posterior estimation. This is because knowing the regime for each datapoint allows for better partitioning of the data, which in turn induces better posterior estimates.

The plot below illustrates this idea. Each coloured region determines a regime. The coloured lines show the mean posterior estimate—one for each regime.

We observe that at the beginning of the experiment ($t \approx 0$), all estimates are updated. This is because the method is not able to find a dominant pattern belonging to a single regime.

Then, at $t \approx 10$, the blue estimate starts to get constantly updated, and all other estimates stop being updated. This is because the posterior predictive probability for the green and the orange regimes are smaller than the blue one.

Then, at $t \approx 100$, the sHMM detects that the blue estimate has small posterior predictive probability, relative to the green and orange, so it stops updating it. It shortly updates the orange estimate, but then only updates the green estimate. This is because the green one is closer to the true mean value, so it has higher probability.

The process continues for the rest of the timesteps. A good regime detection method would only update its estimate whenever datapoints belong to that regime.

hidden Markov model mean estimate

The streaming hidden Markov model

The two ideas presented above — (i) approximation of the path posterior $p(\cZ_t \mid \cY_t)$ via beam search, and (ii) Bayesian estimation of the model posterior $p(\Theta_K \mid \cZ_t,\,\cY_t)$ — together define the streaming HMM. We call it this way because determination of the regimes (through the path posterior) and learning the parameters for each regime (through the model posterior) are done in a streaming (sequential) manner.

At each step, the algorithm expands the current set of $S$ paths to $S \times K$ candidates, scores each candidate using the transition probability and the posterior predictive, retains the top $S$ paths, and produces a forecast before ingesting the next observation.

In pseudocode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@dataclass
class ParticleState:
    means: Array[S]
    variances: Array[S]
    regime: Array[S]
    log_weight: Array[S]


def step(
    particles: ParticleState,
    y: float,
    sigma2: float,
    K: int
):
    regimes = range(K)

    # Forecast estimate before seeing next observation [S -> 1]
    fcst = forecast(particles, sigma2)

    # Update mean and variance of each particle under all regimes [S -> SxK]
    particles, log_pp, log_p_transition = update_conditional_posterior(
        y, regimes, particles, sigma2
    )

    # Update log-weight of all particles [SxK -> SxK]
    particles = update_log_weights(particles, log_pp, log_p_transition)

    # Choose top S particles according to log-weights [SxK -> S]
    particles = beam_search(particles, S)

    return particles, fcst

Forecasting

With the streaming HMM in place, we make forecasts through the posterior predictive. In particular, the posterior predictive for a one-step-ahead forecast is

$$ p(y_{t+1} \mid \cY_t) \approx \sum_{s=1}^S w_t^{(s)}\;\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,k}\;\normal\Big(y_{t+1} \mid m_{t,k}^{(s)},\,s_{t,k}^{2\,(s)} + \sigma^2\Big). $$

It then follows that the posterior predictive mean is

$$ \E[y_{t+1} \mid \cY_t] \approx \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right) \sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k}\;m_{t,k}^{(s)}, $$

and the posterior predictive variance is

$$ {\rm Var}[y_{t+1} \mid \cY_t] = \E[y_{t+1}^2 \mid \cY_t] - \E[y_{t+1} \mid \cY_t]^2, $$ with $$ \E[y_{t+1}^2 \mid \cY_t] \approx \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \left( \left(m_{t,k}^{(s)}\right)^2 + \left(s_{t,k}^{2\,(s)}\right) \right) + \sigma^2 $$

Example: posterior predictive mean and variance

For given s-dimensional array of weights, (s,k)-dimensional array of transition probabilities and (s,k)-dimensional array of estimated means, the posterior predictive mean can be evaluated by

1
mean_pp = jnp.einsum("s,sk,sk->", weights, p_transition, mean)

Similarly, if variances is a (s,k)-dimensional array of posterior variances and var_obs is the observation variance $\sigma^2$, the posterior predictive variance is

1
2
mean2 = jnp.einsum("s,sk,sk...->", weights, p_transition, mean ** 2 + variances)
var_pp = mean2 + var_obs - mean_pp ** 2

The following plot shows the posterior predictive mean and one standard deviation around the posterior predictive standard deviation.

hidden Markov model estimation

Next, we show a slice of the predictions at an interval $t \in [440, 490]$. Even though the model is well-specified, we observe that the posterior predictive mean does not instantaneously adapt to a change in regime. This is mainly due to the irreducible observation noise $\sigma^2 = 1.0$.

hidden Markov model estimation (slice 1)

To emphasise this point, the figure below shows the same interval with the observation noise removed ($\sigma^2 \approx 0$). Here, the observations track the true regime mean exactly, and the forecast adapts a single step after the regime change.

hidden Markov model estimation (slice, noiseless)

This shows that the inherent noise in the data is a big factor in deciding how fast we adapt to regime changes: the noisier the data, the more datapoints are needed to confidently say that a regime change has occurred.

Appendix

Derivation of the posterior predictive

Let $\cZ_t = (z_1, \ldots, z_t)$ be a path at time $t$. The posterior predictive takes the form

$$ \begin{aligned} p(y_{t+1} \mid \cY_t) &= \sum_{\cZ_{t+1}} \int p(y_{t+1}, \cZ_{t+1}, \Theta_K \mid \cY_t)\;\d\Theta_K\\ &= \sum_{\cZ_{t+1}} \int p(y_{t+1}, \cZ_t, z_{t+1}, \Theta_K \mid \cY_t)\;\d\Theta_K\\ &= \sum_{\cZ_{t+1}} \int p(y_{t+1}, z_{t+1} \mid \cZ_t, \Theta_K, \cY_t)\; p(\cZ_t, \Theta_K \mid \cY_t)\;\d\Theta_K\\ &= \sum_{\cZ_t}\sum_{z_{t+1}} \int p(z_{t+1} \mid \cZ_t)\;p(y_{t+1} \mid \theta_{z_{t+1}})\;p(\cZ_t, \Theta_K \mid \cY_t)\;\d\Theta_K\\ &= \sum_{\cZ_t}\sum_{z_{t+1}} \int p(z_{t+1} \mid \cZ_t)\;p(y_{t+1} \mid \theta_{z_{t+1}})\;p(\Theta_K \mid \cZ_t, \cY_t)\;p(\cZ_t \mid \cY_t)\;\d\Theta_K\\ &= \sum_{\cZ_t}\sum_{z_{t+1}} p(z_{t+1} \mid \cZ_t)\; \left(\int p(y_{t+1} \mid \theta_{z_{t+1}})\;p(\Theta_K \mid \cZ_t, \cY_t)\;\d\Theta_k\right)\;p(\cZ_t \mid \cY_t)\\ &= \sum_{\cZ_t}\sum_{z_{t+1}} p(z_{t+1} \mid \cZ_t)\;p(y_{t+1} \mid \cY_{t, z_{t+1}})\;p(\cZ_t \mid \cY_t)\\ &= \sum_{\cZ_t} p(\cZ_t \mid \cY_t)\sum_{z_{t+1}} p(z_{t+1} \mid \cZ_t)\;p(y_{t+1} \mid \cY_{t, z_{t+1}})\\ &= \sum_{\cZ_t} p(\cZ_t \mid \cY_t)\sum_{z_{t+1}} p(z_{t+1} \mid \cZ_t)\;\normal(y_{t+1} \mid m_{t,z_{t+1}},\, s_{t, z_{t+1}}^2 + \sigma^2). \end{aligned} $$

Finally, the approximation is

$$ p(y_{t+1} \mid \cY_t) \approx \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \normal(y_{t+1} \mid m_{t,k}^{(s)},\, s_{t, k}^{2\,(s)} + \sigma^2) $$

Derivation of the mean posterior predictive

The expectation follows directly from the approximate form of the posterior predictive

$$ \begin{aligned} \E[y_{t+1} \mid \cY_t] &= \int y_{t+1}\,p(y_{t+1} \mid \cY_t) \d y_{t+1}\\ &\approx \int y_{t+1}\,\left( \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \normal(y_{t+1} \mid m_{t,k}^{(s)},\, s_{t, k}^{2\,(s)} + \sigma^2) \right) \d y_{t+1}\\ &= \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \left(\int y_{t+1}\,\normal(y_{t+1} \mid m_{t,k}^{(s)},\, s_{t, k}^{2\,(s)} + \sigma^2)\d y_{t+1}\right)\\ &= \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right) \sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k}\;m_{t,k}^{(s)} \end{aligned} $$

Derivation of the variance posterior predictive

Recall that $\var(X) = \E[X^2] - \E[X]^2$. Thus, $$ {\rm Var}[y_{t+1} \mid \cY_t] = \E[y_{t+1}^2 \mid \cY_t] - \E[y_{t+1} \mid \cY_t]^2, $$ with $$ \begin{aligned} \E[y_{t+1}^2 \mid \cY_t] &= \int y_{t+1}^2\,p(y_{t+1} \mid \cY_t) \d y_{t+1}\\ &\approx \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \left(\int y_{t+1}^2\,\normal(y_{t+1} \mid m_{t,k}^{(s)},\, s_{t, k}^{2\,(s)} + \sigma^2)\d y_{t+1}\right)\\ &= \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \left( \left(m_{t,k}^{(s)}\right)^2 + \left(s_{t,k}^{2\,(s)}\right) + \sigma^2 \right)\\ &= \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \left( \left(m_{t,k}^{(s)}\right)^2 + \left(s_{t,k}^{2\,(s)}\right) \right) + \sigma^2 \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k}\\ &= \sum_{s=1}^S \nu_t\left(\cZ_t^{(s)}\right)\sum_{k=1}^K \pi_{z_{t}^{(s)}\!,\,k} \left( \left(m_{t,k}^{(s)}\right)^2 + \left(s_{t,k}^{2\,(s)}\right) \right) + \sigma^2 \end{aligned} $$


  1. See e.g., Chapter 13.2 in Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006. ↩︎

  2. Duran-Martin, Gerardo, et al. “A unifying framework for generalised Bayesian online learning in non-stationary environments.” Transactions on Machine Learning Research (2025). ↩︎

  3. Note that $\cY_{t,k}$ is a function of both observations $\cY_t$, and path $\cZ_t$. This distributes the observations among the regimes. ↩︎

  4. Murphy, Kevin, and Stuart Russell. “Rao-Blackwellised particle filtering for dynamic Bayesian networks.” Sequential Monte Carlo methods in practice. New York, NY: Springer New York, 2001. 499-515. ↩︎

  5. Carvalho, Carlos M., et al. “Particle learning and smoothing.” (2010): 88-106. ↩︎