Bayes for sequential decision making (III): Detecting and adapting to regime changes

Posted on Mar 8, 2026

tl;dr: A tutorial on BONE: learning and adapting to regime changes.

tags: bayes, bandits, uncertainty, sequential-decision-making, non-stationary, regimes

Introduction

In this note, we introduce the BONE framework¹ for learning and adaptation in non-stationary environments. We present various strategies to adapt the one-dimensional learning procedure we saw on the first note, under the assumption of structural changes in the reward process.

A recap

We assume an agent is given a sequence of actions $a_t \in {\cal A}$, and rewards $r_t \in \reals$. Conditioned on the action, the environment produces a reward, i.e., $$ r_t \sim p_{\rm env}(\cdot \mid a_t). $$ The agent’s goal is to maximise the sum of rewards.

The environment $p_{\rm env}$ is unknown to the agent, so it seeks to approximate $p_{\rm env}$ through the choice of model $$ \tag{1} \begin{aligned} \vtheta &= (\theta_1, \ldots, \theta_A),\\ p(\vtheta \mid a) &= \normal(\theta_a \mid m_{a,0},\,s_{a,0}^2),\\ p(r_{t} \mid a, \vtheta) &= \normal(r_{t} \mid \theta_{a},\,\sigma^2). \end{aligned} $$

The agent encodes its belief about the mean of rewards (conditioned on each action) through the posterior over the model parameters, which it updates after pulling action $a_t$ and observing reward $r_t$: $$ p_{\vb_{t}}(\vtheta \mid a_{1:t}) \equiv p(\vtheta \mid a_{1:t}, r_{1:t}) \propto p_{\vb_{t-1}}(\vtheta \mid a_{1:t-1})\,p(r_t \mid a_t, \vtheta). $$ where $\vb_t$ is the belief state, $r_{1:t} = (r_1, \ldots, r_t)$, and $a_{1:t} = (a_1, \ldots, a_t)$.

Algorithmically, we write the process of updating beliefs, given the reward $r_t$ and action $a_t$, by $$ \begin{aligned} \vb_{t} &\gets {\rm update}(\vb_{t-1}, r_t, a_t), \end{aligned} $$ which is done following Bayes’ rule. For the model $({\rm 1})$, the update corresponds to the conjugate Gaussian update for the $a_t$-th arm.

Because the goal is to maximise the sum of rewards, and the rewards are determined by the action we choose, the agent seeks to choose the action that pays the most on average. To do this, it follows the Thompson sampling policy that guides its decision $$ \pi_{\vb_t}(a) = {\rm Pr}_{\vtheta \sim \vb_t}[a = \arg\max_{\hat{a}} h(\vtheta, \hat{a})] = \int {\bf 1}\Big(a = \arg\max_{\hat a} \vtheta_{\hat{a}}\Big)\,p_{\vb_t}(\vtheta)\,{\rm d}\vtheta $$

In pseudocode, the process of learning and decision making looks like as follows

1
2
3
4
5
6
7
def decision_and_update(bel):
    at = agent.choose_action(bel)
    rt = env.get_reward(at)
    bel = agent.update_bel(bel, rt)

    out = (rt, at)
    return bel, out

Non-stationary environments

As we saw in the previous note, the action that yields the highest average reward may change over time. Loosely speaking, if we assume that the reward process is changing, so that the best action changes over time, we say that the environment is non-stationary.

Non-stationarity can described in different ways. For instance, we may assume that the environment

gradually drifts over time at a constant rate (this is the assumption in the previous notes),
resets abruptly, rendering past data useless for inference about the best action,
switches between a finite set of regimes, or
drifts gradually, but at different rates over time.

Mathematically, we encode our assumptions about how the environment is changing through the use of an auxiliary variable $\psi_t$. The meaning of $\psi_t$ depends on the type of non-stationarity:

$\psi_t$	domain	description
constant	$\{c\}$	constant drift or static environment
runlenght	$\{0, 1, \ldots, t\}$	time since last changepoint
changepoint position	$\{0, 1\}^t$	indicators of where changepoints occurred
Mixture of experts	$\{1, 2, \ldots, K\}$	index of one of $K$ possible regimes
hazard rate	$(0,1)$	probability of a changepoint at each step

For now, we will assume that $\psi_t$ is given to us by the environment, i.e., at the start of each timestep, the environment gives us $r_t$ and at the end of each timestep, the environment gives us $\psi_t$. We relax this (very strong) assumption below.

Modifying the prior predictive

For now, assume we always chose the same action $a \in {\cal A}$, so that the sequence of rewards $r_{1:t}$ all belong to the same action. In this case, we are only interested in posterior inference of the mean reward for action $a$.

In this scenario, the posterior over the mean parameter $\vtheta_a = \theta$, given the set of rewards $r_{1:t-1}$, is encoded in the belief state $\vb_{t-1}$. Given the new observation $r_{t}$, the posterior over $\theta$ takes the form $$ p_{\vb_{t}}(\theta) \equiv p(\theta \mid r_{1:t}) \propto p_{\vb_{t-1}}(\theta)\,p(r_{t} \mid \theta). $$

Next, suppose that right after observing $r_t$, but before observing $r_{t+1}$, we are given the $\psi_t$. We use $\psi_t$ to modify the belief $\vb_{t}$, which yields $\vb_{t+1|t}$.

Thus, just as a new observation $r_t$ triggers an update step that yields $\vb_t$, we say that a new value $\psi_t$ triggers a predict step. Algorithmically, we write these two steps as follows

$$ \begin{aligned} \vb_{t|t-1} &\gets {\rm predict}(\vb_{t-1},\,\psi_t),\\ \vb_{t-1} &\gets {\rm update}(\vb_{t|t-1}, r_t). \end{aligned} $$

The difference here is that here the predict step does not necessarily apply Bayes’ rule. Instead it reshapes our beliefs about $\theta$, according to $\psi_t$, before observing the next datapoint.

In other words, $\psi_t$ acts as a switch: by conditioning on $\psi_t$, we make explicit what kind of non-stationary we assume and how the form of non-stationarity affects our current beliefs, before seeing any new observations.

The diagram below illustrates this idea: $$ \begin{array}{c c c c c c c c c} \ldots & \xrightarrow{\rm update} & \vb_{t-1} & \xrightarrow{\rm predict} & \vb_{t|t-1} & \xrightarrow{\rm update} & \vb_t& \xrightarrow{\rm predict} & \ldots \\ & & \uparrow & & \uparrow & & \uparrow & & \\ & & r_{t-1} & & \psi_t & & r_t & & \\ \end{array} $$

In pseudocode, we write the Bayesian learning and sequential decision makin with the modified predict step as

1def decision_and_update(bel, psi):
2    bel = agent.predict_bel(bel, psi)
3    at = agent.choose_action(bel)
4    rt = env.get_reward(at, xt)
5    bel = agent.update_bel(bel, rt)
6
7    out = (rt, at)
8    return bel, out

Examples: Gaussian beliefs

Here, we give some concrete examples of how $\psi_t$ affects $\vb_{t-1}$ to yield $\vb_{t|t-1}$. Note that there are two ad-hoc choices at play here:

the choice of auxiliary variable $\psi_t$ and
how the auxiliary variable affects the posterior at time $t-1$.

Example 1: Constant additive covariance inflation

Suppose $\psi_t \in \{c\}$, then $$ \vb_{t|t-1} \gets {\rm predict}(\vb_{t-1}, \psi_t) = (m_{t-1},\,s_{t-1}^2 + c). $$ This is the simple covariance inflation at every step that we saw before in the previous notes.

1
2
3
4
5
def predict_bel(bel, psi):
    means, variances = bel
    variances = variances + psi
    bel = (means, variances)
    return bel

Example 2: Abrupt changes with forget

Alternatively, take $\psi_t \in \{0, \ldots, t\}$. Then $$ \vb_{t|t-1} \gets {\rm predict}(\vb_{t-1},\,\psi_t) = (m_{t|t-1}, \, s_{t|t-1}), $$ with $$ \begin{aligned} m_{t|t-1} &= m_{t-1}\,\vone(\psi_t > 0) + m_0\,\vone(\psi_t =0),\\ s_{t|t-1}^2 &= s^2_{t-1}\,\vone(\psi_t > 0) + s^2_0\,\vone(\psi_t =0). \end{aligned} $$

Intuitively, this choice of auxiliary variable and predict step means that at each step we have two choices: we either reset to some pre-defined mean and variance $(m_0, s_0^2)$ if $\psi_t = 0$ implying that a changepoint occurred and the past data is not useful in estimating the current value of $\theta_t$, or we simply keep the last belief we computed if $\psi_t > 0$, implying that there hasn’t been a changepoint.

We write this predict step as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def predict_bel(bel, psi):
    mean, variance = bel

    if psi == 0:
        mean, variance = mean_init, variance_init
    else:
        mean, variance = mean, variance

    bel = (means, variances)
    return bel

Example 3: Gradual changes with discount

Alternatively, take $\psi_t \in (0,1)$. Then, $$ \vb_{t|t-1} \gets {\rm predict}(\vb_{t-1},\,\psi_t) = (m_{t|t-1}, \, s_{t|t-1}), $$ with $$ \begin{aligned} m_{t|t-1} &= m_{t-1}\,(1 - \psi_t) + m_0\,\psi_t,\\ s_{t|t-1}^2 &= s_{t-1}^2\,(1 - \psi_t)^2 + s_0^2\,\psi_t^2. \end{aligned} $$

In other words, if we think of $\psi_t$ as the probability of a changepoint, then, this choice of auxiliary variable and predict step simply discounts the prior mean and variance by this probability and softly resets to some pre-specified mean and variance.

We code this predict step as follows

1
2
3
4
5
6
7
def predict_bel(bel, psi):
    mean, variance = bel
    mean = mean * (1 - psi) + mean_init * psi
    variance = variance * (1 - psi ** 2) + variance_init * psi ** 2

    bel = (means, variances)
    return bel

Example 4: Regime transition

This example is a little different. We assume we have $K$ possible regimes, so that $\psi_t \in \{1,\ldots,K\}$. We assume each regime has its own estimate of the mean and variance, i.e., $$ \begin{aligned} \vm_t &= \{m_{t,1},\,m_{2,t},\,\ldots,\,m_{K,t}\}\\ \vs^2_t &= \{s^2_{t,1},\,s^2_{2,t},\,\ldots,\,s^2_{K,t}\} \end{aligned} $$

Then, the predict step chooses one of the $K$ possible regimes, which it uses to update: $$ \vb_{t|t-1} \gets {\rm predict}(\vb_{t-1},\,\psi_t) = (m_\psi, \, s_\psi). $$

One way to think about this example is to believe that there are $K$ possible regimes (or modes or experts). And each timestep we are given the expert that we should use to make the next udpate.

We code this choice as follows:

1
2
3
4
def predict_bel(bel, psi):
    means, variances = bel
    bel = (means[psi], variances[psi])
    return bel

The Bayesian way to estimating the latent variable

In practice, the auxiliary variable $\psi_t$ is unknown and has to be inferred from the environment. For us inferring from the environment means finding the probability of $\psi_t$ conditioned on data $r_{1:t}$, i.e., we need to find the posterior of the latent variable conditioned on $r_{1:t}$. We know we can do this using Bayes’ rule: $$ p(\psi_t \mid r_{1:t}) \propto p(\psi_t \mid r_{t:t-1})\,p(r_t \mid \psi_t). $$ We’ve seen this equation before! It resembles the recursive posterior for the model parameters $\vtheta_t$, i.e., $p(\vtheta_t \mid r_{1:t}) \propto p(\vtheta_t \mid r_{1:t-1})\,p(r_t \mid \vtheta_t)$, where we saw that under a state-space assumption for the model parameters $\vtheta_t$, the posterior was found by marginalising over $\vtheta_{t-1}$ and making use of the Markovian assumption over the model parameters: $$ p(\vtheta_t \mid r_{1:t}) \propto p(r_t \mid \vtheta_t)\,\int p(\vtheta_t \mid \vtheta_{t-1})\,p(\vtheta_{t-1} \mid r_{1:t-1}). $$

We can do something similar for the auxiliary variables: assume that the dynamics of the latent variable are Markovian, i.e., $p(\psi_t \mid \psi_{1:t-1}) = p(\psi_t \mid \psi_{t-1})$ and that $\psi_t$ is discrete. Then, the posterior over the auxiliary variable takes the form $$ \tag{2} \begin{aligned} p(\psi_t \mid r_{1:t}) &\propto p(\psi_t \mid r_{t:t-1})\,p(r_t \mid \psi_t)\\ &= \sum_{\psi_{t-1}} p(\psi_t, \psi_{t-1} \mid r_{t:t-1})\,p(r_t \mid \psi_t)\\ &= \sum_{\psi_{t-1}} p(\psi_{t-1} \mid r_{t:t-1})\,p(\psi_t \mid \psi_{t-1}, r_{1:t-1})\,p(r_t \mid \psi_t)\\ &= p(r_t \mid \psi_t)\,\sum_{\psi_{t-1}} p(\psi_{t-1} \mid r_{1:t-1})\,p(\psi_t \mid \psi_{t-1}, r_{1:t-1}). \end{aligned} $$ Because we are assuming a discrete auxiliary variable, the sum over $\psi_{t-1} \in \Psi_{t-1}$ will typically not give a closed-form for the posterior. However, by keeping track of posteriors $$ \Big\{ p(\psi_{t-1} \mid \data_{1:t-1}) \mid {\psi_{t-1} \in \Psi_{t-1}}\Big\}, $$ $(2)$ gives a way to construct the posterior over the auxiliary variables $p(\psi_t \mid \data_{1:t})$ in a recursive way.

Importantly, the exact computational cost and form of this algorithm will be given by

the domain of the auxiliary variable $\Psi_t$ and
the specific form of $p(\psi_t \mid \psi_{t-1}, r_{1:t-1})$.

Example: detect and reset (infinite non-recurrent regimes)

Suppose $\psi_t \in \{0, \ldots, t\}$ and $p(\psi_t \mid \psi_{t-1}, y_{1:t-1}) = p(\psi_t \mid \psi_{t-1})$ with $$ p(\psi_t \mid \psi_{t-1}) = \begin{cases} \pi & \psi_t = 0,\\ (1 - \pi) & \psi_t = \psi_{t-1} + 1. \end{cases} $$ Then, $(2)$ is simplified as $$ p(\psi_t \mid r_{1:t}) = \begin{cases} p(r_t \mid \psi_t)\,(1 - \pi)\,p(\psi_{t-1} \mid r_{1:t-1}) & \psi_t > 0,\\ p(r_t \mid \psi_t)\,\pi\,\sum_{\psi_{t-1}=0}^{t-1} \,p(\psi_{t-1} \mid r_{1:t-1}) & \psi_t = 0. \end{cases} $$

We observe that finding the posterior $p(\psi_t \mid r_{1:t})$ requires the computation of $t$ elements

This method corresponds to the Bayesian online changepoint detection (BOCD/BOCPD) first introduced by Adams and MacKay (2007). ²

For a more detailed explanation and an implementation with a Beta-Bernoulli model, see this post.

Predicting and updating

Put together, the choice of

auxiliary variable $\psi_t$,
the transition probability $p(\psi_t \mid \psi_{t-1})$, and
the observation model (as shown in $({\rm 1}))$

induces the following state-space model:

Here, $\vx_t$ are additional sources of information such as contexts or features.

Thus, put together, inference over this SSM requires keeping track of two collection of posteriors: the posterior for $\psi_t$ and the posterior for $\theta_t$, conditioned on $\psi_t$:

$$ \begin{aligned} p_{\vb_{t|t-1}^{\psi_t}}(\theta_t) &\equiv p(\theta_t \mid \psi_t,\,r_{1:t}) \propto p_{\vb_{t-1}}(\theta_t \mid \psi_t)\,p(r_t \mid \theta_t),\\ p_{\vnu_{t}}(\psi_t) &\equiv p(\psi_t \mid r_{1:t}) \propto p_{\vb_{t-1}}(\psi_t)\,p(r_t \mid \psi_t). \end{aligned} $$

The predict-update diagram is as follows: $$ \begin{array}{cccccc} & & \{\psi\}_{\psi_t} & & \boldsymbol\nu_{t} & & \{\psi\}_{\psi_t} & & \boldsymbol\nu_{t+1} & \\[3pt] & & \downarrow & & \uparrow & & \downarrow & & \uparrow & \\[-2pt] \ldots & \xrightarrow{\rm predict} & \{\vb^{\psi}_{t|t-1}\}_{\psi} & \xrightarrow{\rm update} & \{\vb^{\psi}_{t}\}_{\psi} & \xrightarrow{\rm predict} & \{\vb^{\psi}_{t+1|t}\}_{\psi} & \xrightarrow{\rm update} & \{\vb^{\psi}_{t+1}\}_{\psi} & \xrightarrow{\rm predict} \ldots \\[3pt] & & & & \uparrow & & & & \uparrow & \\ & & & & r_t & & & & r_{t+1} & \end{array} $$

Thompson sampling with regimes

Under the state-space model formulation, sequential decision making using TS is done as follows. Given a dataset $\data_{1:t}$, with $\data_t = (r_t, \vx_t, a_t)$, and the TS policy $$ \pi_{\vb_t}(a) = \int {\bf 1}\Big(a = \arg\max_{\hat a} h(\vtheta_t, \hat{a})\Big)\,p_{\vb_t}(\vtheta_t)\,{\rm d}\vtheta_t, $$ with $ h(\vtheta_t, a) = \mathbb{E}[R \mid a,\,\vtheta_t]$ and $p_{\vb_t}(\vtheta_t) \equiv p(\vtheta_t \mid \data_{1:t})$.

To perform sequential learning and decision making with a state space model we now need to include an additional predict step which modifies the prior before sampling:

For $\psi_t \in \Psi_t$:
- $\hat{\theta}_{\psi_t} \sim p_{\vb_{\psi_t}}(\theta)$
$\hat{\theta} = \sum_{\psi_t} \hat{\theta}_{\psi_t}\,p_{\boldsymbol\nu_t}(\psi_t)$
$a_{t+1} = \argmax_{a \in {\cal A}} h(\hat{\theta}, a)$
$r_{t+1} \sim p_{\rm env}(\cdot \mid a_{t+1})$
$\data_{t+1} \gets (a_{t+1}, r_{t+1})$
For $\psi_{t} \in \Psi_{t}$:
- For $\psi_{t+1} \in \{\psi \in \Psi_{t+1} \mid p(\psi \mid \psi_{t}) > 0\}$:
  - $\vb_{\psi_{t+1}|t} \gets \texttt{predict}(\vb_{\psi_{t}},\, \psi_{t+1})$
  - $\vb_{\psi_{t+1}} \gets \texttt{update}\left(\vb_{\psi_{t+1}|t},\,\data_{t+1}\right)$
  - $\boldsymbol\nu_{\psi_{t+1}} \gets \texttt{update.aux}(\boldsymbol\nu_{\psi_t},\ \vb_{\psi_t},\ \psi_{t+1},\ \data_{t+1})$

Duran-Martin, Gerardo, et al. “A unifying framework for generalised Bayesian online learning in non-stationary environments.” arXiv preprint arXiv:2411.10153 (2024). ↩︎
Adams, Ryan Prescott, and David JC MacKay. “Bayesian online changepoint detection.” arXiv preprint arXiv:0710.3742 (2007). ↩︎