Bayes for sequential decision making (II): filtering as learning

Posted on Feb 3, 2026

tl;dr: Adapting to non-stationarity through a state-space model assumption on the model parameters.

tags: bayes, bandits, uncertainty, sequential-decision-making, non-stationary, regimes

Introduction

The algorithms we studied in the previous note assumed a stationary environment: the data-generating-process (DGP) does not depend on time. Under this assumption, the learning algorithms gradually converge to a point estimate, which means that they either stop exploring, stop learning, or both.

In various real-world scenarios, we do not want our algorithms to converge. The underlying DGP can change over time: user preferences evolve, markets undergo structural changes, system degrade over time, etc. If we ignore this, we end up with an agent, whose beliefs converge to a point estimate even as the world move on. The result is an algorithm which makes poor decisions with stale data, and fails to adapt.

In this second part, we explore Bayesian approaches for learning and adaptation in non-stationary environments. Specifically, we see how to extend the recursive Bayesian updates to account for model drifts, structural breaks, and forgetting mechanisms that allows us to remain adaptive.

We start with a small recap of Bayesian sequential learning and decision making through Thompson sampling (TS).

A recap: Bayesian learning and sequential decision making through Thompson sampling

Given a dataset $\data_{1:t}$, with $\data_t = (r_t, \vx_t, a_t)$, the policy defined by contextual TS is $$ \pi_{\vb_t}(a \mid \vx) = \int {\bf 1}\Big(a = \arg\max_{\hat a} h(\vtheta, \hat{a}, \vx)\Big)\,p_{\vb_t}(\vtheta)\,{\rm d}\vtheta, $$ with $ h(\vtheta, a, \vx) = \mathbb{E}[R \mid a,\,\vx,\,\vtheta]$ and $p_{\vb_t}(\vtheta) \equiv p(\vtheta \mid \data_{1:t})$.

Sequential learning and decision making amounts to performing the following steps

$\hat{\theta} \sim p_{\vb_t}(\theta)$ // sample from posterior
$a_{t+1} = \argmax_{a \in {\cal A}} h(\hat{\theta}, a)$ // Choose the arm with the largest expected reward under the sampled parameters
$r_{t+1} \sim p_{\rm env}(\cdot \mid a_{t+1})$
$\data_{t+1} \gets (a_{t+1}, r_{t+1})$
$\vb_{t+1} \gets \texttt{update}(\vb_t,\,\data_{t+1})$ // Algorithmic Bayes update $p(\theta \mid \data_{1:t}) \propto p(\theta \mid \data_{1:t-1})\,p(\data_t \mid \theta)$

Trading robots example with regime change

Suppose we want to invest our money in one of $A$ trading robots (or trading algorithms). Each robot has an average (unknown) daily return $\theta_{a,t}$ for $a=1,\ldots, A$ and $t \geq 1$.

Assumptions.

We choose a single robot at the beginning of each day $t$.
We only observe the performance of the chosen robot at end of the day.
Choosing robot $i$ does not affect the future performance of any other robot $j$ (stateless assumption).
Robot $a$ has an unknown average return at end of day $t$ $\theta_{t,a} \in \reals$.

At the beginning of each day $t$, the reward of the robot changes with probability $p_{\epsilon}$ and takes values between $a_{\min}$ and $a_{\max}$. That is, for each robot $a$:

$$ \theta_{a,t} = \begin{cases} \theta_{a,t-1} & \text{w.p. } (1 - p_\epsilon),\\ {\cal U}[r_{\min}, r_{\max}] & \text{w.p. } \epsilon. \end{cases} $$

For the examples in these notes, we take

$A = 4$,
$p_\epsilon = 1/500$,
$r_{\min} = -0.1$,
$r_{\max} = 0.1$,
$\sigma^2 = 0.1^2$.

The example below shows the expected return of a robot, if we were to pull each arm. We highlight in gray the arm that pays the most at every timestep.

Changes in the data-generating process

Now, we assume that the environment changes through time. In pseudocode, we store the state of environment in the variable cfg, which we assume the agent has no access to, but is helpful for the oracle (us) to evolve the data-generating process (DGP).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def decision_and_update(state, xt):
    bel, cfg = state

    cfg = env.update_cfg(cfg)
    at = agent.choose_action(bel, xt)
    rt = env.get_reward(at, xt, cfg)
    bel = agent.update_bel(bel, xt, rt)

    state = (bel, cfg)
    out = (rt, at)
    return state, out

Example: Thompson sampling with static Bayesian updates

Take the probabilistic model $$ \begin{aligned} \vtheta &= (\theta_1, \ldots, \theta_a),\\ p(\vtheta \mid a) &= {\cal N}(\theta_a \mid m_{a,0},\,s_{a,0}^2),\\ p(r \mid a, \theta) &= {\cal N}(r \mid \theta_{a},\,\sigma^2). \end{aligned} $$ Here, $ h(a, \theta) = \theta_a$. We run TS to choose the action that maximises the payoff,

First, we show the rolling sum of the number of times each arm was pulled.

We observe that the two arms that get pulled the most are 0 and 2. In particular, for the first 625 steps, the agent prefers arm 0. Then, between 625 and 1300, the agent prefers arm 2. Afterwards, the agent prefers agent 0 again.

The two arms that agent favours (0 and 2) more or less match the true agent with maximum underlying payoff. However, it does not pick when when the robot with maximum payoff changes to 3.

To understand this, we plot the the estimated value of arms as a function of the timestep.

We observe that that the arms with maximum expected return are the 0 and 2. As time progresses, the uncertainty about the best arm slowly concentrates for arms 0 and 2, which are the arms with most payoff. Arms 1 and 3 don’t get pulled as much and their uncertainty remains relatively high throughout the experiment.

Finally, we show the profit and loss (PnL) as a function of time, along with an oracle agent that pulls the correct arm at each timestep.

We observe that the discrepancy between the oracle agent and our TS-static agent grows over time, indicating that TS-static agent is not being able to adapt to the DGP of rewards.

Part of the reason why TS-static is not able to get close to the oracle is that it does not have any notion of non-stationarity. What this means in practice is that the uncertainty about the value of each arm decreases as a function of the number of pulls only, regardless of how well or or bad the agent performs.

The following Figure shows the posterior standard deviation for each arm $s_{a,t}$ (uncertainty estimate) as a function of the arm, as well as which arm was pulled. This shows how the uncertainty decreases only as a function of the times that an arm is pulled.

State-space models for adaptation

A classical way to induce adaptation in non-stationary environments is to assume that the model parameters gradually change over time. In this case, we stop assuming that there exists a single random vector $\vtheta$ that governs all rewards, and instead assume that each reward $y_t$ has its own set of parameters $\vtheta_t$, which evolve according to a discrete-time Markov chain, i.e., $p(\vtheta_t \mid \vtheta_{t-1},\,\ldots, \,\vtheta_0) = p(\vtheta_t \mid \vtheta_{t-1})$.

We write the following graphical model (GM) that illustrates this relationship.

The GM above implies the following factorisation $$ p(\theta_{0:t},\,y_{1:t}) = p(\theta_0)\,\prod_{t=1}^T p(\theta_t \mid \theta_{t-1})\,p(y_t \mid \theta_t). $$

Similar to the previous notes, we are interested in estimating the posterior density over the the average reward $\theta_t$ conditioned on all seen data: $$ p(\theta_t \mid r_{1:t}) \propto p(\theta_t \mid r_{1:t-1})\,p(r_t \mid \theta_t). $$ However, the form of this posterior is different that the one we saw before. First, the posterior is now local in time. Second, the posterior obtained at $t-1$ is $p(\theta_{t-1} \mid r_{1:t-1})$, so we need to predict the density for $\theta_t$ given the information up to $t-1$, i.e., we need to go from $p(\theta_{t-1} \mid r_{1:t-1})$ to $p(\theta_{t} \mid r_{1:t-1})$.

Mathematically, this prediction is done through marginalisation of $\theta_{t-1}$: $$ \begin{aligned} p(\theta_t \mid r_{1:t-1}) &= \int p(\theta_t,\,\theta_{t-1} \mid r_{1:t-1}) \d\theta_{t-1}\\ &= \int p(\theta_{t-1} \mid r_{1:t-1})\,p(\theta_t \mid \theta_{t-1}) \d\theta_{t-1}. \end{aligned} $$ The resulting density is called the prior predictive.

The result of these steps is an algorithm that first predicts the state one step into the future and the corrects (or updates) the predicted reward with the new information. We illustrate this predict/update loop in the diagram below: $$ \begin{array}{c c c c c c c c c} \ldots & \xrightarrow{\rm update} & p(\theta_{t-1} \mid r_{1:t-1}) & \xrightarrow{\rm predict} & p(\theta_t \mid r_{1:t-1}) & \xrightarrow{\rm update} & p(\theta_t \mid r_{1:t}) & \xrightarrow{\rm predict} & \ldots \\ & & \uparrow & & & & \uparrow & & \\ & & r_{t-1} & & & & r_t & & \\ \end{array} $$

The following table summarises the prior/posterior for the static and the SSM-dynamic case.

Step	Posterior ($t-1$)	Prior ($t$)
Static	$p(\theta \mid r_{1:t-1})$	$p(\theta \mid r_{1:t-1})$
SSM-dynamic	$p(\theta_{t-1} \mid r_{1:t-1})$	$p(\theta_t \mid r_{1:t-1})$

A note on notation

With the predict-update loop, we now write these two steps in belief propagation form as $$ p_{\vb_{t}}(\theta_t) \propto p_{\vb_{t|t-1}}(\theta_t)\,p(r_t \mid \theta_t). $$ Here, $p_{\vb_{t}}(\theta_t) \equiv p(\theta \mid r_{1:t})$ is the posterior at time $t$ and $p_{\vb_{t|t-1}}(\theta_t) \equiv p(\theta_t \mid r_{1:t-1})$ is the prior predictive at time $t$.

The form above shows that there are two computations needed to obtain posterior at time $t$:

modify the previous posterior to obtain the prior predictive, i.e., $\vb_{t-1} \to \vb_{t|t-1}$,
update the prior predictive to obtain the posterior, i.e., $\vb_{t|t-1} \to \vb_t$.

We illustrate this predict/update loop for the beliefs in the diagram below: $$ \begin{array}{c c c c c c c c c} \ldots & \xrightarrow{\rm update} & \vb_{t-1} & \xrightarrow{\rm predict} & \vb_{t|t-1} & \xrightarrow{\rm update} & \vb_t & \xrightarrow{\rm predict} & \ldots \\ & & \uparrow & & & & \uparrow & & \\ & & r_{t-1} & & & & r_t & & \\ \end{array} $$

Example: linear state-space models and the Kalman filter

An special (and very famous) case of an SSM is a Gaussian linear state-space model: $$ \begin{aligned} p(\vtheta_t \mid \vtheta_{t-1}) &= {\cal N}(\vtheta_t \mid \vF_t\,\vtheta_{t-1},\,\vQ_t),\\ p(\vy_t \mid \vtheta_{t}) &= {\cal N}(\vy_t \mid \vH_t\,\vtheta_{t},\,\vR_t).\\ \end{aligned} $$ In this case, the the prior predictive is Gaussian $$ p(\vtheta_t \mid \data_{1:t-1}) = {\cal N}(\vtheta_t \mid \vmu_{t|t-1},\,\vSigma_{t|t-1}), $$ with $$ \boxed{ \begin{aligned} \vmu_{t|t-1} &= \vF_t\,\vmu_{t-1},\\ \vSigma_{t|t-1} &= \vF_t\,\vSigma_{t-1}\,\vF_t^\intercal + \vQ_t. \end{aligned} } $$ Next, the posterior takes the form $$ p(\vtheta_t \mid \vy_{1:t}) = {\cal N}(\vtheta_t \mid \vmu_t, \vSigma_t) $$ with $$ \boxed{ \begin{aligned} \vS_t &= \vH_t\,\vSigma_{t|t-1}\vH_t^\intercal + \vR_t\\ \vK_t &= \vSigma_{t|t-1}\,\vH_t\,\vS_t^{-1}\\ \vSigma_t &= \vSigma_{t|t-1} - \vK_t\,\vS_t\,\vK_t\\ \vmu_t &= \vmu_{t|t-1} + \vK_t(\vy_t - \vH_t\,\vmu_{t|t-1}) \end{aligned} } $$

The recursive form of the predict and update steps are known as the Kalman filter equations. ¹

For detailed discussion of linear state-space models, see this post; for a derivation of the KF see this post; for a detailed discussion of the KF, see this post.

Example: recursive Gaussian estimation

Suppose that the model parameters evolve over time according to a random walk, i.e., $$ \begin{aligned} p(\theta_0) &= {\cal N}(\theta \mid m_0,\,s_0^2),\\ p(\theta_t \mid \theta_{t-1}) &= {\cal N}(\theta_t \mid \theta_{t-1},\,q_t^2),\\ p(r_{t} \mid \theta_t ) &= {\cal N}(r_{t} \mid \theta_t, \sigma_t^2).\\ \end{aligned} $$

Here, the prior predictive is $$ \begin{aligned} p(\theta_t \mid r_{1:t-1}) &= \int p(\theta_{t-1} \mid r_{1:t-1})\,p(\theta_t \mid \theta_{t-1}) \d\theta_{t-1}\\ &= \int {\cal N}(\theta_{t-1} \mid m_{t-1},\,s_{t-1}^2)\,{\cal N}(\theta_t \mid \theta_{t-1},\, q) \d\theta_{t-1}\\ &= {\cal N}(\theta_{t} \mid m_{t-1},\,s_{t-1}^2 + q^2)\\ &\equiv p_{\vb_{t|t-1}}(\theta_t), \end{aligned} $$ with $\vb_{t|t-1} = (m_{t|t-1}, s_{t|t-1}^2) = (m_{t-1}, s_{t-1}^2 + q^2)$.

With this in mind, we can now estimate the posterior density which is again Gaussian: $$ p_{\vb_{t}}(\theta_t) \propto p_{\vb_{t|t-1}}(\theta_t)\,p(r_t \mid \theta_t) \propto {\cal N}(\theta_t \mid m_t,\,s_t^2). $$

The resulting posterior mean and variance take the form $$ \boxed{ \begin{aligned} m_{t|t-1} &= m_{t-1}\\ s_{t|t-1}^2 &= s_{t-1}^2 + q_t\\ m_{t} &= k_t\,r_t + (1 - k_t) m_{t-1}\\ s_{t}^2 &= (1 - k_t)\,s_{t|t-1}^2,\\ k_t &= \frac{s_{t|t-1}^2}{s_{t|t-1}^2 + \sigma^2}. \end{aligned} } $$

Algorithmically, this entails considering beliefs $\vb_{t}$ and $\vb_{t|t-1}$ that characterise the posterior and prior predictive and each step:

$$ \ldots \xrightarrow{\rm update} (m_{t-1},\,s_{t-1}^2) \xrightarrow{\rm predict} (m_{t|t-1},\,s_{t|t-1}^2) \xrightarrow{\rm update} (m_{t},\,s_{t}^2) \xrightarrow{\rm predict} \ldots $$

Thompson sampling within state-space models

Under the state-space model formulation, sequential decision making using TS is done as follows. Given a dataset $\data_{1:t}$, with $\data_t = (r_t, \vx_t, a_t)$, and the TS policy $$ \pi_{\vb_t}(a) = \int {\bf 1}\Big(a = \arg\max_{\hat a} h(\vtheta_t, \hat{a})\Big)\,p_{\vb_t}(\vtheta_t)\,{\rm d}\vtheta_t, $$ with $ h(\vtheta_t, a) = \mathbb{E}[R \mid a,\,\vtheta_t]$ and $p_{\vb_t}(\vtheta_t) \equiv p(\vtheta_t \mid \data_{1:t})$.

To perform sequential learning and decision making with a state space model we now need to include an additional predict step which modifies the prior before sampling:

$\vb_{t+1|t} \gets \texttt{predict}(\vb_{t})$ // prior predictive step $p(\vtheta_t \mid \data_{1:t-1})$
$\hat{\theta} \sim p_{\vb_{t+1|t}}(\theta_t)$ // sample from prior predictive
$a_{t+1} = \argmax_{a \in {\cal A}} h(\hat{\theta}, a)$ // make decision
$r_{t+1} \sim p_{\rm env}(\cdot \mid a_{t+1})$ // obtain reward
$\data_{t+1} \gets (a_{t+1}, r_{t+1})$ // build datapoint
$\vb_{t+1} \gets \texttt{update}(\vb_t,\,\data_{t+1})$ // algorithmic Bayes update $p(\theta \mid \data_{1:t})$

In pseudocode, we write these steps as

1
2
3
4
5
6
7
8
def decision_and_update(bel):
    bel = agent.predict_bel(bel)
    at = agent.choose_action(bel)
    rt = env.get_reward(at, xt)
    bel = agent.update_bel(bel, rt)

    out = (rt, at)
    return bel, out

which includes the new line for belief propagation.

Example: Thompson sampling with state-space models under regime changes

Here, we put together the ideas presented in the previous section and test the TS within SSM algorithm in the trading robots example.

Pseudocode: Gaussian Thompson sampling

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def choose_action(key, bel):
    sample = jax.random.normal(key, (n_arms,))
    est_rewards = sample * jnp.sqrt(bel.vars) + bel.means
    return est_rewards.argmax()

def predict_bel(bel, q):
    means, variances = bel
    variances = variances + q
    bel = (means, variances)
    return bel

def update_bel(reward, action, bel):
    means, variances = bel
    mean, var = means[action], variances[action]

    err = reward - mean
    kt = var / (var + var_obs)
    mean_new = mean + kt * err
    var_new = kt * var_obs

    means[action] = mean_new
    variances[action]  = var_new
    bel = (means, variances)
    return bel

Results

First, we compare the posterior mean and variance for each arm as a function of time.

Similarly to the static case, arm 3 does not get pulled much at the very beginning, so the variance remains relatively high for a little more than 1000 steps. Then, at around 1100 steps, the estimate for arm 0 starts to decay, which in turn makes the agent explore all four arms. Eventually, it finds that the best new arm is arm 3.

The relationship between time and number of pulled arms is shown in the Figure below.

Finally, we compare the performance of the TS-bandit with static Bayesian updates and adaptive Bayesian updates.

We observe that at the beginning of the experiment, both TS-adaptive and TS-static perform on par, but as more regime changes occur, the performance of TS-static degrades over time. TS-adaptive, on the other hand, manages to adapt to regime changes—we observe a small period where the performance degrades, but this is mostly followed by a period where the best arm is pulled.

On adaptivity*

Here, we discuss some results on the posterior variance and the learning rate as $t \to \infty$.

Limiting posterior variance

Recall, the posterior variance evolves according to the update equations $$ \boxed{ \begin{aligned} s^2_{t|t-1} &= s^2_t + q,\\ k_t &= \frac{s^2_{t|t-1}}{s^2_{t|t-1} + \sigma},\\ s^2_t &= (1 - k_t) s^2_{t|t-1}.\\ \end{aligned} } $$ Notice that this variance does not depend on the actual observations $r_t$. At each timestep $s_t^2$ is fully determined by: the process noise $q$ (how much uncertainty the dynamics inject / how much the parameters are moving), the observation variance $\sigma^2$, and and the prior variance $s_{t-1}^2$.

We can rewrite the recursion more compactly. Expressing $s_t^2$ in terms of of $s_{t-1}^2$ gives $$ \begin{aligned} s^2_t &= \left(1 - \frac{s^2_{t|t-1}}{s^2_{t|t-1} + \sigma}\right) s^2_{t|t-1}\\ &= \left(1 - \frac{s^2_{t-1} + q}{s^2_{t-1} + q + \sigma}\right) (s^2_{t-1} + q)\\ &= \frac{\sigma\,(s^2_{t-1} + q)}{s^2_{t-1} + q + \sigma}. \end{aligned} $$ This is a nonlinear recurrence: each posterior variance depends on the previous one.

To analyse its long-term behaviour, we recall the fixed-point method: if $x_{t+1} = f(x_{t})$, then $$ x_\infty \triangleq \lim_{t\to\infty} x_{t} = \lim_{t\to\infty} f(x_{t-1}) = f\left(\lim_{t\to\infty} x_{t-1}\right) = f(x_\infty). $$

In our case Applying this to our recurrence yields $$ s_\infty = \lim_{t \to \infty} s_t = \lim_{t \to \infty} \frac{s_{t-1} + q}{s_{t-1} + q + \sigma}. $$ This condition reduces to $$ s_\infty = \frac{\sigma\,(s_{\infty} + q)}{s_{\infty} + q + \sigma} \iff s_{\infty}^2 + q\,s_{\infty} - \sigma\,q = 0. $$ Solving yields the limiting posterior variance $$ s_\infty = \frac{-q + \sqrt{q^2 + 4\,\sigma\,q}}{2}. $$

If $q = 0$ (no process noise), then $s_\infty = 0$. This means that in a stationary environment, every new observation reduces uncertainty, and with infinitely many datapoints, the variance collapses to zero. In the limit, there’s nothing new to learn, so we don’t learn.

Conversely, if $q > 0$, then $s_\infty > 0$. This means that if the environment is changing, each new observation is informative to maintain accurate estimates of the posterior mean. Thus, we never reduce the variance all the way to zero.

Example: posterior variance for the non-stationary robots

Limiting learning rate

As a consequence of the limiting posterior variance, we can compute the limiting learning rate, i.e., the learning rate that we obtain as $t \to \infty$. Similar to the limiting posterior variance, the limiting learning rate, which we will denote $k_\infty$, does not depend on the data and takes the form. $$ s_\infty = \frac{\sigma\,(s_{\infty} + q)}{s_{\infty} + q + \sigma} \iff \frac{s_\infty}{\sigma} = \frac{s_{\infty} + q}{s_{\infty} + q + \sigma} \triangleq k_\infty $$ Again, if $q=0 \implies s_\infty = 0 \implies k_\infty = 0$: as we obtain more and more data, we update our beliefs less and less. This explains why static-TS is not able to adapt to the changes in the DGP after a number of steps.

Kalman, Rudolph Emil. “A new approach to linear filtering and prediction problems.” (1960): 35-45. ↩︎