Bayes for sequential decision making (I): preliminaries
Introduction
In this note, we provide an introduction to the problem of sequential learning and decision making from a Bayesian perspective.
With this perspective, we tackle two problems under a same framework: on the one hand, balancing exploration and exploitation, and on the other, learning a parametric model of rewards. As we will see, the first problem can be tackled with the Thompson sampling algorithm and the second problem can be tackled through the Bayesian formulation of recursive learning.
Problem formulation
Assume we have an agent in an environment. The agent interacts with the environment by taking actions $a \in {\cal A}$. As a consequence of taking an action, the agent receives a reward $r \in {\cal R} \subseteq \reals$ from the environment. Through continual interaction, the agent and the environment produce a sequence of actions and rewards: $$ a_1,\,r_1,\,a_2,\,r_2,\,\ldots,\,r_{t},\,a_t,\,\ldots,\,a_{T},\,r_T. $$
The goal of the agent is to find the sequence of actions that maximise the expected sum of rewards. If the rewards depend only on the actions made at each timestep, then the agent’s objective is $$ \max_{a_1, \ldots, a_T} \E\left[\sum_{t=1}^T R_t \mid A_1=a_1, \ldots, A_T=a_T\right] = \sum_{t=1}^T \max_{a_t} \E[R_t \mid A_t=a_t]. $$
If we knew $\E[R_t \mid A_t=a]$ for all $a$, maximisation of the expected reward is straightforward: at each step, take the action that maximises the expected reward. In practice, however, we do not know which action yields the highest expected reward. Thus, we need to balance two things:
- exploration of actions to build estimates of the expected rewards,
- exploitation of actions known to be good to maximise the sum of rewards.
This exploration-exploitation tradeoff is at the heart of many problems in sequential decision making.
Beliefs and policies
To balance the exploration-exploitation tradeoff, the agent maintains a stochastic policy $\pi_{\vb_t}: {\cal A} \to \Delta({\cal A})$ where $\Delta(\vX)$ is the set of probability distributions over the set $\vX$, and $\vb_t$ are the beliefs of the agent. The policy $\pi_\vb$, together with the beliefs $\vb$ guide the agent on which decision to take at each step.
Thus, the agent interacts with the environment and updates its beliefs as follows: start with beliefs $\vb_1$, so that the initial policy is $\pi_{\vb_1}(a)$. At each round (timestep) $t=1,\ldots, T$:
- $a_{t+1} \sim \pi_{\vb_t}(a)$ // sample an action from the policy given current beliefs
- $r_{t+1} \sim p_{\rm env}(r \mid a_{t+1})$ // obtain reward from the environment
- $\vb_{t+1} \gets \texttt{update}(\vb_t,\, a_{t+1},\,r_{t+1})$ // update beliefs
In pseudocode, we write these steps as
| |
Then, to run an experiment for T steps, we simply iterate the decision_and_update function defined above.
In pseudocode, this takes the form
| |
A running toy example: Trading robots
Suppose we want to invest our money in one of $A$ trading robots (or trading algorithms).
Assumptions.
- We choose a single robot at the beginning of each day.
- We observe the return of the chosen robot at end of the day.
- Choosing robot $i$ does not affect the future performance of any other robot $j$ (stateless assumption).
- Each robot $a$ has an unknown average return $\E[R_t \mid A_t=a] =\theta_a \in \reals$.
Goal.
- Choose the trading robot maximises the expected cumulative return over $T$ steps.
This sequential decision making problem comprises two things:
- Find out which trading robot has the highest payoff (exploration).
- Trade with the robot that we believe yields the highest expected payoff (exploitation).
In the sections that follow, we introduce different ways to
- construct the policy $\pi_\vb$,
- define the beliefs $\vb$ and
- update the beliefs given new information.
Each choice 1-3 yield a different algorithm, which we compare on the trading robots problem.
$\epsilon$-greedy policy with frequentist beliefs
The first kind of algorithm we’ll study is $\epsilon$-greedy. Simply put, this algorithm defines a policy that, with probability $1 - \epsilon$, chooses the action that it believes pays the most, and, with probability $\epsilon$, chooses a random action. The beliefs $\vb_t$ are defined as the (frequentist) estimate of the expected value for each action.
Mathematically, we define the $\epsilon$-greedy policy at time $t$ as $$ \pi_{\vb_t}(a) = \begin{cases} 1 - \epsilon + \epsilon / |{\cal A}| & a = \arg\max_{\hat{a} \in {\cal A}} m_{t}(\hat{a}),\\ {\epsilon}/{|{\cal A}|} & \text{otherwise.} \end{cases} $$ with $$ \vb_{t} = \{ (m_t(a),\,N_{t}(a)) \mid a \in {\cal A} \}, $$ $m_t(a)$ is the average payoff for action $a$ and $N_t(a)$ is the number of times action $a$ has been pulled up to time $t$.
The belief updates at time $t$, having pulled action $a_t$ and received reward $r_t$ are $$ \boxed{ \begin{aligned} m_t(a) &= \frac{1}{N_t(a)} \sum_{\tau = 1}^t r_\tau\,\vone(a_k=a),\\ N_{t}(a) &= N_{t-1}(a) + \vone(a_t = a). \end{aligned} } $$
The mean update can be shown to be written in the recursive form $$ m_t(a) = m_{t-1}(a) + \frac{\vone(a_t = a)}{N_{t}(a)}[r_{t} - m_{t-1}(a)]. $$
Pseudocode: epsilon-greedy
A pseudoalgorithm for the $\epsilon$-greedy bandit with simple average as estimate for the actions is shown below
| |
Example: $\epsilon$-greedy for trading robot selection
The Figures below show the result of running the $\epsilon$-greedy bandit with $\epsilon=0.1$.
First, we show the estimated value of actions as a function of the timestep.

The top right panel shows the rolling sum of the number of times each action was pulled.

Finally, we show the profit and loss (PnL) as a function of time.

Recursive Bayesian learning: the one-dimensional Gaussian case
The strategy we used above to estimate the average value of each action was based on a simple average. While straightforward, this approach does not make use of any prior knowledge about the data-generating-process (DGP) of the rewards, and thus it cannot make statistically efficient updates.
To illustrate this, suppose we know that the rewards are Gaussian with known variance $\sigma^2$ and unknown mean $\theta$. Furthermore, suppose we believe that the average reward $\theta$ is centred around zero and, with a 95% probability, lies in the interval $[-0.2, 0.2]$.
One way to encode both our knowledge of the DGP (the variance) and our prior beliefs about $\theta$ is to specify a probabilistic model. Given the assumptions above, a natural probabilistic model is $$ \begin{aligned} p(\theta) &= \normal(\theta \mid m_0,\,s_0^2),\\ p(r \mid \theta) &= \normal(r \mid \theta,\,\sigma^2), \end{aligned} $$ with prior mean $m_0 = 0$ and prior variance $s_0^2 = 0.1^2$.
Next, suppose we are given a sequence of rewards $r_{1:t} = (r_1, \ldots, r_t)$ from repeated pulls of a fixed action. Our updated beliefs about $\theta$ are encoded in the posterior density $p(\theta \mid r_{1:t})$.
Since each action is generated as $r = \theta + e$, where $e \sim \normal(0, \sigma^2)$, we have $$ m_t = \E[R \mid r_{1:t}] = \E[\theta + e_t \mid \data_{1:t}] = \E[\theta \mid \data_{1:t}]. $$
In traditional Bayesian sequential decision making, one of the central challenges is how to estimate (or approximate) the posterior density $p(\theta \mid r_{1:t})$ under different prior and observation models. The key tool for this is Bayes’ rule.
The two faces of Bayes
In the static view of Bayes, the posterior is found by decomposing the prior from the likelihood: $$ p(\theta \mid r_{1:t}) \propto \underbrace{p(\theta)}_{\rm prior}\; \underbrace{p(r_{1:t} \mid \theta)}_{\rm likelihood}. $$
Here, however, we are mainly interested in the recursive view of Bayes:
$$ p(\theta \mid r_{1:t}) \propto \underbrace{p(\theta \mid r_{1:t-1})}_{\rm prior}\; \underbrace{p(r_t \mid \theta)}_{\rm likelihood}. $$This recursive formulation gives us a sequence of posterior densities, updated step by step, where the posterior at time $t-1$ becomes the prior at time $t$.: $$ \begin{array}{c c c c c c c c c} p(\theta) & \xrightarrow{\rm update} & p(\theta \mid r_1) & \xrightarrow{\rm update} & p(\theta \mid r_{1:2}) & \xrightarrow{\rm update} & \ldots & \xrightarrow{\rm update} & p(\theta \mid r_{1:t}) \\ & & \uparrow & & \uparrow & & & & \uparrow \\ & & r_1 & & r_2 & & \ldots & & r_t \\ \end{array} $$
Example: recursive Gaussian estimation (Gaussian-Gaussian conjugate update rule)
Consider the one-dimensional probabilistic model for rewards $$ \begin{aligned} p(\theta) &= \normal(\theta \mid m_0,\,s_0^2),\\ p(r_{t} \mid \theta ) &= \normal(r_{t} \mid \theta, \sigma^2).\\ \end{aligned} $$ Then, the posterior is $$ p(\theta \mid r_{1:t}) = \normal(\theta \mid m_{t},\,s_t^2),\\ $$ with $$ \boxed{ \begin{aligned} m_{t} &= k_t\,r_t + (1 - k_t) m_{t-1}\\ s_{t}^2 &= (1 - k_t)\,s_{t-1}^2,\\ k_t &= \frac{s_{t-1}^2}{s_{t-1}^2 + \sigma^2}. \end{aligned} } $$
Remarks
- The equation above shows a recursive way to estimate the posterior density.
- The update for $m_t$ is an exponentially-weighted moving average (EWMA) of the rewards $r_t$ with varying forgetting factor.
- The term $k_t \in (0,1)$ (sometimes called the Kalman gain) is a learning rate: it determines how much the current observation $r_t$ shifts our beliefs relative to the accumulated past.
- The variance update reflects that uncertainty shrink as more data is observed.
For a derivation of the recursive equations, see this post.
A note on notation: belief propagation form
For the Gaussian posterior the beliefs are the sufficient statistics that characterise the posterior density over $\theta$ given the dataset $\data_{1:t}$. That is, $$ \vb_t \equiv (m_t, s_t^2), $$ so that the posterior is written compactly as $$ p_{\vb_t}(\theta) \equiv p(\theta \mid \data_{1:t}) \propto p_{\vb_{t-1}}(\theta)\,p(r_t \mid \theta). $$
In what follows, we use the notation $p(\theta \mid \data_{1:t})$ to derive the formulas and use the notation $p_{\vb_t}(\cdot)$, when talking about the algorithmic implementation of the methods.
$\epsilon$-greedy with Bayesian estimation
Armed with the Bayesian recursive update formula for a Gaussian prior with Gaussian likelihood with unknown mean and known variance, we derive an alternative update rule for $\epsilon$-greedy that considers Bayesian updates after pulling each action.
In particular, suppose that the reward for action $a \in {\cal A}$ is modelled as Gaussian with with known variance $\sigma^2$ and unknown mean $\theta_a$. Then, consider a prior density for the the unknown mean $\theta_a$ to be gaussian with mean $m_{a,0}$ and variance $s_{a,0}^2$. We have the following probabilistic model:
$$ \begin{aligned} \vtheta &= (\theta_1, \ldots, \theta_a),\\ p(\vtheta \mid a) &= \normal(\theta_a \mid m_{a,0},\,s_{a,0}^2),\\ p(r_{t} \mid a, \theta) &= \normal(r_{t} \mid \theta_{a},\,\sigma^2). \end{aligned} $$The policy. Consider again the $\epsilon$-greedy policy $$ \pi_{\vb_t}(a) = \begin{cases} 1 - \epsilon + \epsilon / |{\cal A}| & a = \arg\max_{\hat{a} \in {\cal A}} m_{t}(\hat{a}),\\ {\epsilon}/{|{\cal A}|} & \text{otherwise.} \end{cases} $$
The belief state at time $t$ is the collection of posterior parameters for all actions: $$ \vb_{t} = \{(m_{a,t},\,s_{a,t}^2) \mid a \in {\cal A}\}. $$
Since the posterior for each action’s mean is Gaussian, the expected reward for action $a$ is simple the current posterior mean: $$ m_t(a) = \E[R \mid a, \data_{1:t}] = m_{a,t}. $$
Remarks
- Only the beliefs of the action taken is updated via the recursive Gaussian update. The beliefs for all other actions are simply copied forward.
- This makes the $\epsilon$-greedy Bayesian in the sense that decision are guided by the posterior means instead of the sample averages.
- Uncertainty about the value of each action is retained in the belief state through $s_{a,t}^2$.
Pseudocode: Bayesian epsilon-greedy
| |
Example: Bayesian $\epsilon$-greedy policy for trading robot selection
The Figures below show the result of running the Bayesian $\epsilon$-greedy bandit with $\epsilon=0.1$.
First, we show the estimated value of actions as a function of the timestep.

Next, we show the rolling sum of the number of times action was chosen.

Finally, we show the profit and loss (PnL) as a function of time for both the classical $\epsilon$-greedy and the Bayesian $\epsilon$-greedy agents.

Thompson sampling: Bayesian learning and sequential decision making
A major downside of the $\epsilon$-greedy policy is that the exploration-exploitation tradeoff is hardcoded at the start: with probability $\epsilon$ we explore, and with probability $1 - \epsilon$ we exploit.
Ideally, the exploration-exploitation rate should be adaptive: explore more when we are uncertain about the expected reward of an action, exploit more if we are confident about the mean value of each action. A Bayesian approach to achieve this adaptive balance of the exploration-exploitation tradeoff is Thompson sampling (TS). 1 2
The idea behind TS is to sample from the posterior beliefs over the unknown model parameters $\vtheta=(\theta_1, \ldots, \theta_A)$ and use those samples as proxies for estimates of the rewards. As we obtain more data from each action, the uncertainty around the values of the actions starts to decrease. As a consequence, the overall rate of exploration decreases as well— we choose more and more frequently the actions that have yielded highest rewards.
Formally, given a dataset $\data_{1:t}$, with $\data_t = (r_t, a_t)$ and posterior density characterised by $p_{\vb_t}(\theta) \equiv p(\theta \mid \data_{1:t})$, the policy defined by TS is $$ \pi_{\vb_t}(a) = {\rm Pr}_{\vtheta \sim \vb_t}[a = \arg\max_{\hat{a}} h(\vtheta, \hat{a})] = \int {\bf 1}\Big(a = \arg\max_{\hat a} h(\theta, \hat{a})\Big)\,p_{\vb_t}(\theta)\,{\rm d}\theta $$ with $ h(\vtheta, a) = \mathbb{E}[R \mid a,\,\vtheta]$. In our case, $h(\vtheta, a) = \theta_a$.
In practice, the policy induced by TS first samples from the posterior and then evaluate the sampled parameters in the mean function $h$, which is used to make a decision. That is
- $\hat{\theta} \sim p_{\vb_t}(\theta)$ // sample from posterior
- $a_{t+1} = \argmax_{a \in {\cal A}} h(\hat{\theta}, a)$ // Choose the action with the largest expected reward under the sampled parameters
- $r_{t+1} \sim p_{\rm env}(\cdot \mid a_{t+1})$
- $\data_{t+1} \gets (a_{t+1}, r_{t+1})$
- $\vb_{t+1} \gets \texttt{update}(\vb_t,\,\data_{t+1})$ // Algorithmic Bayes’ rule $p(\theta \mid \data_{1:t}) \propto p(\theta \mid \data_{1:t-1})\,p(\data_t \mid \theta)$
Pseudocode: Gaussian Thompson sampling
| |
Example: Thompson sampling for trading robot selection
Take the probabilistic model $$ \begin{aligned} \vtheta &= (\theta_1, \ldots, \theta_a),\\ p(\vtheta \mid a) &= \normal(\theta_a \mid m_{a,0},\,s_{a,0}^2),\\ p(r \mid a, \theta) &= \normal(r \mid \theta_{a},\,\sigma^2). \end{aligned} $$ Here, $ h(a, \theta) = \theta_a$.
The Figures below show the result of running the Bayesian TS bandit. First, we show the estimated value of actions as a function of the timestep.

Next, we show the rolling sum of the number of times each action was pulled.

Finally, we show the profit and loss (PnL) as a function of time for both the classical $\epsilon$-greedy and the Bayesian $\epsilon$-greedy agents.

Example: Bandit showdown
Here we compare the performance of $\epsilon$-greedy with frequentist updates (counting), $\epsilon$-greedy with Bayesian updates, and TS.

Contextual bandits
So far, our bandit agents have treated each action in isolation and without input from some external source of information. In many real problems, however, we can make better decisions depending on some context.
| case | reward $r_t$ | context $\vx_t$ |
|---|---|---|
| Trading robots | daily return | market volatility, interest rates, credit spreads |
| Video recommendation | seconds of watch time | number of video likes |
| Dating app | swipe right | profile features |
| Food delivery app | customer rating | time of day, user preferences, type of meal |
In this setup, we assume that the agent has access to external context $\vx \in {\cal X}$ (or state in the RL literature) that provides information about which action to take at time $t$. Thus, the agent first observes a context $\vx_t$, which it uses to determine the action $a_t$, and it observes a reward $r_t$. Through continual observation of contexts and interaction with the environment, the agent and the environment produce a sequence of contexts, actions and rewards: $$ (\vx_1,\,a_1,\,r_1),\,(\vx_2,\,a_2,\,r_2),\,\ldots,(\vx_t\,\,r_{t},\,a_t),\,\ldots,\,(\vx_T,\,a_{T},\,r_T). $$
Modelling a linear contextual bandit
If rewards depends linearly on context (with Gaussian noise with known variance $\sigma^2$), there are two natural model classes:
Shared weights. In this scenario, $\vtheta \in \reals^D$ and $\phi: {\cal X}\times{\cal A}\to\reals^D$ is some feature transformation. $$ p(r \mid \vtheta, a, x) = \normal(r \mid \vtheta^\intercal\,\phi(x, a),\,\sigma^2). $$ This model assumes shared information across actions pulled by a global weight vector $\vtheta$.
Individual actions. In this scenario, $\vtheta \in \reals^{D\times A}$, $\vtheta_a \in \reals^D$, and $\phi: {\cal X}\to\reals^D$ is some feature transformation. $$ p(r \mid \vtheta, a, x) = \normal(r \mid \vtheta_a^\intercal\,\phi(x),\,\sigma^2). $$ This model assumes that each action is independent and is updated only when chosen.
Both cases reduce to a Bayesian linear regression problem. The main difference is when and how frequent updates are made: Shared weights always update, regardless of which action is pulled, individual actions only updates the weights of the action that are chosen.
Modelling a contextual bandit
We can summarise a contextual bandit with shared weights as the following probabilistic graphical model (PGM):

The DGM above assumes that $$ p(\theta,\,y_{1:t}) = p(\theta)\,\prod_{t=1}^T p(y_t \mid \theta). $$
Recursive Bayesian updates for linear regression
Let $\data_t = (y_t, \vx_t, a_t)$. For independent actions, assume $$ \begin{aligned} p(\vtheta) &= \normal(\vtheta \mid \vm_0,\,\vSigma_0)\\ p(\data_t \mid \vtheta) &= \normal(y_t \mid \vtheta_a^\intercal\,\phi(\vx_t),\,\sigma^2). \end{aligned} $$ The posterior is Gaussian $$ p(\vtheta \mid \data_{1:t}) = \normal(\vtheta \mid \vm_{t},\,\vSigma_t),\\ $$ with recursive updates $$ \boxed{ \begin{aligned} \vS_t &= \phi_t^\intercal\,\vSigma_{t-1}\,\phi_t + \sigma^2,\\ \vK_t &= \vSigma_{t-1}\,\phi_t\,\vS_t^{-1},\\ \vm_t &= \vm_{t-1} + \vK_t\,(r_t - \vm_{t-1}^\intercal\,\phi_t),\\ \vSigma_t &= \vSigma_{t-1} - \vK_t\,\vS_{t}\,\vK_t^\intercal. \end{aligned} } $$
This is the Bayesian linear regression analogue of the Kalman filter.
Recursive Bayesian linear regression and Ridge regression
In the special case where $p(\vtheta_t) = \normal(\vtheta_t \mid \vzero,\,\alpha\vI)$, then $\vm_t$ matches the Ridge regression estimate 3, i.e., $$ \vm_t = (\Phi^\intercal\,\Phi + \lambda\,\vI)^{-1}\,\Phi\,\vr, $$ $$ \Phi = \begin{bmatrix} \phi_1^\intercal\\ \vdots \\ \phi_t^\intercal \end{bmatrix} \in \reals^{t\times D} $$ the time-ordered matrix of feature transformations, $$ \vr = \begin{bmatrix} r_1 \\ \vdots \\ r_t \end{bmatrix} \in \reals^t $$ the time-ordered rewards, and $$ \lambda = \frac{\sigma^2}{\alpha} $$ is the ridge penalty. Intuitively, we see that if $\sigma^2$ is small, then $\lambda$ is small: observations are highly informative, so little shrinkage is needed. Alternatively, if $\alpha$ is small, then $\lambda$ is large: the prior is strong, so we are reluctant to move far from it.
Example: Recursive Bayesian linear regression v.s. Ridge regression


Thompson sampling for a contextual bandit
Given a dataset $\data_{1:t}$, with $\data_t = (r_t, \vx_t, a_t)$, the policy defined by contextual TS is $$ \pi_{\vb_t}(a \mid \vx) = \int {\bf 1}\Big(a = \arg\max_{\hat a} h(\vtheta, \hat{a}, \vx)\Big)\,p_{\vb_t}(\vtheta)\,{\rm d}\vtheta, $$ with $ h(\vtheta, a, \vx) = \mathbb{E}[R \mid a,\,\vx,\,\vtheta]$ and $p_{\vb_t}(\vtheta) \equiv p(\vtheta \mid \data_{1:t})$.
A recap
Bayesian sequential decision making:
- Exploration-exploitation tradeoff through Thompson Sampling.
- Update beliefs at every step through recursive formulation of Bayes.
- In the linear setting, recursive Bayes yields the same result as in batch / offline Bayes.
Chapelle, Olivier, and Lihong Li. “An empirical evaluation of thompson sampling.” Advances in neural information processing systems 24 (2011). ↩︎
Russo, Daniel J., et al. “A tutorial on thompson sampling.” Foundations and TrendsĀ® in Machine Learning 11.1 (2018): 1-96. ↩︎
See e.g., Section 2.3.1 in Duran-Martin, Gerardo. “Adaptive, Robust and Scalable Bayesian Filtering for Online Learning.” arXiv preprint arXiv:2505.07267 (2025). ↩︎