[slides] Adaptive, robust, and scalable Bayesian filtering for online learning

Posted on Jun 4, 2025

tl;dr: Slides for PhD thesis.

Online learning and sequential decision making

Observe sequence of features $\vx_t\in\reals^M$ and observations $\vy_t \in \reals^o$: $$ {\cal D}_{1:t} = \{(\vx_1, \vy_1), \ldots, (\vx_t, \vy_t)\}. $$

We model a mapping from $\vx_t$ to $\vy_t$ through a function $h: \Theta \times \reals^M \to \reals^o$ that maps model parameters $\vtheta \in \Theta$ and features $\vx_t \in \reals^M$ to outputs $\vy_t \in \reals^o$.

Given $\vx_{t+1}$, make a prediction about the next value in the sequence $$ \hat{\vy}_{t+1} = \omega(h, \vtheta, \vx_{t+1}, \data_{1:t}). $$

Examples: online learning and sequential decision making

Bandits¹

Let $\mathcal{A} = {a^{(1)}, \ldots, a^{(K)}}$ be a set of actions. At every time step $t=1,\ldots,T$

we are given a context $\vx_t$
take action $a_t \in {\cal A}$ based on $\vx_t$
obtain a reward $\vy_t$ based on the context $\vx_t$ and the chosen action $a_t$

Our goal is to choose the set of actions that maximise the expected reward $\sum_{t=1}^T\mathbb{E}[Y_t]$.

Example:

Given a seed playlist title and/or initial set of tracks in a playlist, predict the subsequent tracks in the playlist.²

Recommendation turned bandit problem: given user’s features, past actions, and $K$ possible songs, predict probability of the percentage of listening time a particular song, pick the song that yields the highest expected percentage of listening time.

Toxic buy trade in EUR/USD market (fast online classification)

Having a trade from a client, the broker has to decide whether to keep the trade in their book (internalise) or close the position immediately.

Trader: buys at orange line, sell in blue region
Broker: sells in orange line, buys in blue region

Broker wants to externalise as much toxic flow as possible and internalise as much toxic flow as possible (classification problem).

Prequential (one-step-ahead) forecasting of hourly electricity load

Given weather condition $\vx_t$, predict the hour-ahead electricity load $\vy_t$.

Recursive Bayesian online learning

Parametric probabilistic graphical model

Observation $\vy_t$ is modelled through a likelihood $p(\vy_t \cond \vtheta, \vx_t)$ with conditional mean $h(\vtheta, \vx_t)$, e.g., a neural network.

Recursive Bayesian online learning

Suppose we are given the prior (initial condition) $p(\vtheta)$ and the likelihood model $p(\vy_t \cond \vtheta, \vx_t)$ with conditional mean $h(\vtheta, \vx_t)$. We seek to obtain a recursive estimate for the posterior density $$ \begin{aligned} p(\vtheta \cond \data_{1:t}) &\propto \underbrace{p(\vtheta \cond \data_{1:t-1})}_\text{prior density}\, \underbrace{p(\vy_{t} \cond \vtheta, \vx_t)}_\text{likelihood}, \end{aligned} $$

Given $\vx_{t+1}$ and $p(\vtheta)$, the prediction about the (expected) next value in the sequence is given by $$ \hat{\vy}_{t+1} = \int h(\vtheta \cond \vx_{t+1})\,p(\vtheta \cond \data_{1:t})\,\d\vtheta =: \mathbb{E}_{p(\vtheta \cond \data_{1:t})}[h(\vtheta, \vx_{t+1})]. $$

Recursive sequential decision making through Thompson Sampling³ ⁴

Let $\data_t = (x_t, a_t, y_t)$ be the context, action, and reward at time $t$.

At every time step $t=1,\ldots, T$, we follow the following procedure:

Sample $\vtheta_t \sim p(\cdot \vert \mathcal{D}_{1:t-1})$
$a_t = \arg\max_{a \in \mathcal{A}} \mathbb{E}[h(\vtheta, (\vx_t, a))]$
Obtain $\vy_t \sim p(\cdot \cond \vx_t,a_t)$
Store $\mathcal{D}_t = (\vx_t, a_t, \vy_t)$
Update $p(\vtheta \cond \data_{1:t}) \propto p(\vtheta \cond \data_{1:t-1}) p(\vy \cond \vtheta, \vx_t, a)$.

Example: Beta-Bernoulli bandit with $K=4$ arms.

The linear Gaussian (LG) case

Suppose $p(\vtheta) = {\cal N}(\vtheta \cond \vmu_0, \vSigma_0)$ and $p(\vy_t \cond \vtheta, \vx_t) = {\cal N}(\vy_t \cond \vtheta^\intercal\vx_t, \vR_t)$, with known $\vR_t$. Then, $$ p(\vtheta \cond \data_{1:t}) = {\cal N}(\vtheta \cond \vmu_t, \vSigma_t), $$ with

$$ \begin{aligned} \vS_t &= \vx_t^\intercal\vSigma_{t-1}\vx_t + \vR_t,\\ \vK_t &= \vSigma_{t-1}\vx_t\vS_t^{-1},\\ \vmu_t &= \vmu_{t-1} + \vK_t\,(\vy_t - \vx_t^\intercal\,\vmu_{t-1}),\\ \vSigma_t &= \left(\vI - \vK_t\vx_t\right)\vSigma_{t-1}. \end{aligned} $$

The maximum-a-posteriori (MAP) prediction having $\vx_{t+1}$ and $\data_{1:t}$ is $$ \hat{\vy}_{t+1} = \vmu_{t}^\intercal\,\vx_{t+1}. $$

Remark

The linear gaussian case with $\vmu_0 = {\bf 0}$, $\vSigma_0 = \alpha^2,\vI$, and $\vR_t = \beta^2$ for all $t \in {1,\ldots, T}$. matches the Ridge regression algorithm.

Takeaway: we seek methods that update parameters (beliefs) learn one-data point at a time.

The (recursive) variational Bayes (VB)

For cases when $p(\vy_t \cond \vtheta, \vx_t)$ is non-linear, non-Gaussian.

$$ q_t(\vtheta) = \arg\min_{q \in {\cal Q}} {\rm KL}\left({q(\vtheta)} \,\|\, p(\vy_t \cond \vtheta)\,q_{t-1}(\vtheta)\,/\,Z_t\right), $$

yields the fixed-point equations $$ \begin{aligned} \vmu_t &= \vmu_{t-1} + \vSigma_{t-1}\,\mathbb{E}_{q_{t}}[\nabla_\vtheta \log p(\vy_t \cond \vtheta, \vx_t)],\\ \vSigma_t^{-1} &= \vSigma_t^{-1} - \mathbb{E}_{q_{t}}\left[\nabla^2_\vtheta \log p(\vy_t \cond \vtheta, \vx_t)\right]. \end{aligned} $$

State-space models: the regression case

Generalising the sequential online learning problem assuming that parameters evolve over time.

$$ \begin{aligned} p(\vtheta_t \cond \vtheta_{t-1}) &= {\cal N}(\vtheta \cond f(\vtheta_{t-1}), \vQ_t),\\ p(\vy_t \cond \vtheta, \vx_t) &= {\cal N}(\vy_t \cond h(\vtheta_t, \vx_t), \vR_t). \end{aligned} $$

Why?

Tackling misspecification.
Numerical stability.

Probabilistic graphical model

The linearised / moment-matched state-space model

Take $f(\vtheta) = \vtheta$ (random walk assumption) and suppose $h(\vtheta, \vx_t)$ is differentiable w.r.t. $\vtheta$. A first-order approximation of $h$ around the previous mean $\vmu_{t-1}$ yields $$ h(\vtheta, x) \approx \bar{h}_t(\vtheta, x) = H_t\,(\vtheta_t - \vmu_{t-1}) + h(\vmu_{t-1}, x_t). $$

$$ p(y_t \vert \vtheta, x_t) \approx {\cal N}(y_t\,\vert\,\hat{y}_t, R_t), $$ where $R_t$ is the moment-matched variance of the observation process.

The MAP prediction having $\vx_{t+1}$ and $\data_{1:t}$ is $$ \hat{\vy}_{t+1} = \mathbb{E}[\bar{h}(\vtheta_t, \vx_{t+1}) \vert {\cal D}_{1:t}] = h(\vmu_{t}, \vx_{t+1}). $$

Example: moment-matched linear Gaussian for classification

$$ \begin{aligned} \vH_t &= \nabla_\theta h(\vmu_{t-1}, \vx_t) & \text{(Jacobian)}\\ \hat{\vy}_t & = h(\vmu_{t-1}, \vx_t) & \text{ (MAP prediction)} \\ \vR_t &= \hat{\vy}_t\,(1 - \hat{\vy}_t) & \text{ (moment-matched variance)}\\ \vSigma_t^{-1} &= \vSigma_{t-1}^{-1} + \vH_t^\intercal\,\vR_t^{-1}\,\vH_t & \text{(posterior precision)}\\ {\bf K}_t &= \vSigma_{t}\,\vH_t\,{\vR}_t^{-1} & \text{(gain matrix)}\\ \hline \vmu_t &\gets \vmu_{t-1} + \vK_t\,(\vy_t - \hat{\vy}_t) & \text{(update)} \end{aligned} $$

Sequential Online classification

Online Bayesian learning using a single hidden-layer neural network and moment-matched updates.

sequential classification with static dgp

Challenges of filtering as Bayesian online learning

1. Adaptivity to non-stationary environments

The methods above (LG and VB) assume that the data generating process is fixed / known.
Mean estimate $\mu_T$ under LG converges to a point estimate as $T \to \infty$ regardless of error, i.e., $\Sigma_T \to 0 {\bf I}$.
They are sequential, but do not have a notion of non-stationarity (model misspecification).

2. Sensitive to model misspecification and outliers

Gaussian assumption over model parameters and linearised likelihood renders the posterior sensitive to outliers and model misspecification.

3. They are not scalable

Overall computational cost depends on model size.
For EKF, e.g., is $O(D^2)$ in memory and $O(D^3)$ in time.

Contributions

a modular framework that enables the development adaptive approaches for online learning;
a novel, provably robust filter with similar computational cost to standard filters, that employs Generalised Bayes; and
a set of tools for sequentially updating model parameters using approximate second-order optimisation methods that exploit the overparametrisation of high-dimensional parametric models such as neural networks.

Adaptivity

Generalised Bayesian online learning in non-stationary environments: BONE.

A motivating toy example: non-stationary moons dataset

Motivating example (cont’d): moment-matched linear Gaussian

sequential classification with varying dgp

The research question

How do we adapt to changes in the data-generating process when the underlying dynamics are unknown?

Probablistic graphical model

Introduce an auxiliary variable $\psi_t$ that models the non-stationarity.

The modified target posterior and prediction

We seek to obtain recursive estimates for the posterior density $$ \begin{aligned} p(\vtheta_t, \psi_t \cond \data_{1:t}) &\propto p(\vtheta_t, \psi_t \cond \data_{1:t-1})\, p(\vy_{t} \cond \vtheta_t, \vx_t)\\ &= p(\psi_t \cond \data_{1:t-1})\, p(\vtheta_t \cond \psi_t, \data_{1:t-1})\, p(\vy_{t} \cond \vtheta_t, \vx_t). \end{aligned} $$

A one-step-ahead prediction is given by $$ \begin{aligned} \hat{\vy}_{t+1} &= \mathbb{E}_{p(\vtheta_t, \psi_t \cond \data_{1:t})}[h(\vtheta_t,\vx_{t+1})]\\ &= \sum_{\psi_t \in \Psi_t}p(\psi_t \cond \data_{1:t}) \int h(\vtheta_t, \vx_{t+1}) p(\vtheta_t \cond \psi_t, \data_{1:t})\d\vtheta_t. \end{aligned} $$

remark: hard to specify all components and its dynamics

The BONE modification

We find the many methods in the literature can be written as a modified (generalised Bayes) version of the posterior predictive.

$$ \begin{aligned} &q(\vtheta_t;\,\psi_t, \data_{1:t}) \propto \overbrace{\pi(\vtheta_t;\, \psi_t, \data_{1:t-1})}^\text{\rm prior}\, \overbrace{\exp(-\ell(\vy_t; \vtheta_t, \vx_t))}^\text{\rm loss for model},\\ &\hat{\vy}_t := \sum_{\psi_t \in \Psi_t} \underbrace{\nu(\psi_t \cond \data_{1:t})}_{\text{\rm weight}}\, \int \underbrace{h(\vtheta_t, \vx_{t+1})}_{\text{\rm model}}\, \underbrace{q(\vtheta_t;\, \psi_t, \data_{1:t})}_{\text{\rm posterior}} \d\vtheta_t. \end{aligned} $$

Method specification

Auxiliary variable $\psi$: model for changes in the data-generating process.
Weight $\nu$: belief about value of auxiliary variable.
(Conditional) prior $\pi$: ad-hoc modification on parameters before observing new datapoint
Posterior $q$: generalised-Bayes posterior (prior X lossfn).
Model $h$: mapping from parameters and features to outputs.

Methods in the literature

Example: runlength with prior reset

Online Bayesian learning using (M.1) a single hidden-layer neural network and (A.1) moment-matched LG with a runlenght auxiliary variable.

The takeaway

There are various methods in the literature that tackle the problem of adapting to non-stationary environments from a Bayesian perspective:
- Segmentation,
- prequential forecasting,
- bandits, and
- online continual learning.

We show that these methods are all instances of a unifying framewosk we call BONE: generalised (B)ayesian (O)nline learning in (N)on-stationary (E)nvironments.

Robustness

The weighted-observation likelihood filter.

Online learning and the problem of outliers

The influence of the KF to a single outlier

For posterior probability density $q$, let the posterior influence function (PIF) $$ {\rm PIF}_q(\vy_t^{c},\data_{1:t}) = {\rm KL} \left( q(\vtheta_t \cond \data_{1:t}^c) \,\|\, q(\vtheta_t \cond \data_{1:t}) \right), $$ where $\data_{1:t}^c = \data_{1:t-1} \cup (\vx_t, \vy^c)$.

Then, for KF, the influence is unbounded, i.e., $$ \sup_{\vy_t^c \in \reals^o}|{\rm PIF}_q(\vy_t^{c},\data_{1:t})| = \infty. $$

The wolf method: motivation

Many methods in the literature that robustify against outliers. We seek one that is

easy to implement,
provably robust (PIF sense),
scales at a similar compute cost to KF.

The wolf method

$$ q(\vtheta_t \cond \data_{1:t}) \propto \exp(-\ell_t(\vtheta_t))\,q(\vtheta_t \cond \data_{1:t-1}). $$

with $$ \ell_t(\vtheta_t) = -W^2(\vy_t, \hat{\vy}_t)\,\log{\cal N}\left(\vy_t \cond \hat{\vy}_t\, \vR_t\right). $$

The Wolf algorithm

Under linear dynamics, the WoLF algorithm modifies the update equations of the KF to be:

$$ q_{\rm WOLF}(\vtheta_{t} \cond \data_{1:t}) = {\cal N}(\vtheta_t \cond \vmu_{t}, \vSigma_{t}). $$

with, $$ \begin{aligned} \hat{\vy}_t &= \vH_t\,\vmu_{t|t-1}\\ \vS_t &= \vH_t\,\vSigma_{t|t-1}\,\vH_t^\intercal + \vR_t / {\color{orange}{W(\vy_t, \hat{\vy}_t)}},\\ \vK_t &= \vSigma_{t|t-1}\vH_t^\intercal\,\vS_t^{-1},\\ \vmu_t &= \vmu_{t|t-1} + \vK_t\,(\vy_t - \hat{\vy}_t),\\ \vSigma_t &= \left(\vI - \vK_t\vH_t^\intercal\right)\vSigma_{t|t-1}, \end{aligned} $$ and $$ W(\vy_t, \hat{\vy}_t) =\left(1+\frac{\|\vR_t^{-1/2}(\vy_t - \hat{\vy}_t)\|_2^2}{c^{2}}\right)^{-1/2}. $$

The PIF under Wolf

Then, for WoLF, the influence is bounded, i.e., $$ \sup_{\vy_t^c \in \reals^o}|{\rm PIF}_{q_{\rm wolf}}(\vy_t^{c},\data_{1:t})| < \infty. $$

Example: online learning with WoLF

Scalability

An exploration on SSM assumptions and ways to update the SSMs in high-dimensional, misspecified models: subspace, last-layer and subspace, and low-rank VB.

A running example: Fashion-mnist classification with a convolutional neural network

Consider a modified LeNet5 convolutional neural network with d=62_556 parameters.
We seek to perform online learning and classification of the Fashion MNIST dataset.

Running example: cont’d

Under EKF of CNN, we require storage of mem_gb = d ** 2 * 32 * 0.125 * 10 ** (-9) == 125.2 (gigabyte) per step.

The research question

How do we perform efficient and scalable online learning with deep neural networks?

Subspace parameters

Use lottery ticket hypothesis for scalable online learning.

For a neural network with $D$ parameters, rewrite SSM so that inference occurs in a subspace vector $\vz \in \reals^d$, where $d \ll D$.

$$ \begin{aligned} \vz_t &= \vz_{t-1} + \vu_t,\\ \vtheta_t &:= {\bf A}\,\vz_t + \vtheta_*,\\ \vy_t &= h(\vtheta_t, \vx_t) + \ve_t, \end{aligned} $$ with $\vA \in \reals^{D\times d}$.

Update

Follows KF-like equations with modified projection matrix $\vA$.

$$ \begin{aligned} \vH_t &= \nabla_\vtheta h(\vA\,\vmu_{t|t-1} + \vtheta_*, \vx_t)\\ \vS_t &= \vH_t\,\vA\,\vSigma_{t-1}\,\vA^\intercal\,\vH_t^\intercal + \vR_t\\ \vK_t &= \vSigma_{t|t-1}\,\vA^\intercal\,\vH_t^\intercal\,\vS_t^{-1}\\ \vmu_t &= \vmu_{t-1} + \vK_t(\vy_t - \vH_t\,\vA\,\vmu_{t|t-1})\\ \vSigma_t &= \vSigma_{t|t-1} - \vK_t\,\vA^\intercal\,\vH_t^\intercal\vSigma_{t|t-1} \end{aligned} $$

Performance

Pros: Efficient and simple to implement.
Cons Required estimation of $\vA$ offline.

Subspace-hidden last-layer parameters

Use lottery ticket hypothesis and last-layer neural networks for scalable online learning.

Divide last-layer and model parameters, then perform online VB update.

$$ \begin{aligned} \vz_t &= \vz_{t-1} + \vu_{t}^{\rm hidden},\\ \vw_t &= \vw_{t-1} +\vu_t^{\rm last},\\ \vtheta_t &= ({\bf A}\,\vz_t + \psi_*, \vw_t),\\ \vy_t &= h(\vtheta_t, \vx_t) + \ve_t, \end{aligned} $$

Update

$$ q_t(\vw, \vz) = {\cal N}(\vw \cond \nu_t, \Gamma_t)\,{\cal N}(\vz \cond \mu_t, \Sigma_t), $$ with $$ \nu_t, \mu_t, \Gamma_t, \Sigma_t = \argmin_{\nu, \mu, \Gamma, \Sigma} {\rm KL} \left( {\cal N}(\vw \cond \nu, \Gamma)\, {\cal N}(\vz \cond \mu, \Sigma) \,\|\, q_{t-1}(\vw, \vz)\, p(\vy_t \cond \vtheta, \vx_t) \right). $$

Performance

Pros explicit modelling of the last-layer and better performance than subspace method.
Cons required estimation of $\vA$ offline and slower than the subspace methods.

A low-rank Filtering algorithm

Use factor analysis ideas to keep track of a low-rank plus diagonal precision matrix.

Assume a diagonal plus low-rank (D+LR) precision matrix at time $t-1$ then predict, update and and approximate a new D+LR precision matrix.

$$ \begin{aligned} \vtheta_t &= \vtheta_{t-1} + \vu_t,\\ \vy_t &= h(\vtheta_t, \vx_t) + \ve_t, \end{aligned} $$

Update

Suppose $$ q_{t-1}(\vtheta_{t-1}) = {\cal N}\Big(\vtheta_{t-1} \cond \mu_{t-1}, \left(\vW_{t-1}\,\vW_{t-1}^\intercal + \Upsilon_{t-1}\right)^{-1}\Big). $$

Then, $$ \begin{aligned} p(\vtheta_t \cond \data_{1:t-1}) &\propto p(\vy_t \cond \vx_t, \vtheta_t)\,q_{t-1}(\vtheta)\\ &={\cal N}\left(\vtheta_t \cond \vmu_t, \vSigma_{t}^{-1}\right), \end{aligned} $$ with $$ \begin{aligned} \ve_t &= \vy_t - \hat{\vy}_t\,,\\ \vmu_t &= \vmu_{t|t-1} + \left[ \Upsilon_{t|t-1}^{-1} - \Upsilon_{t|t-1}^{-1}\,\tilde{\vW_t} \left( \vI_{d+o} + \tilde{\vW}_t^\intercal\,\Upsilon_{t|t-1}^{-1}\,\tilde{\vW}_t \right)^{-1}\,\tilde{\vW}_t^\intercal\,\Upsilon_{t|t-1}^{-1} \right]\vH_t^\intercal\,\vR_t^{-1}\,\ve_t,\\ \vSigma_{t}^{-1} &= \Upsilon_{t|t-1} + \tilde{\vW}_t\,\tilde{\vW}_t^\intercal. \end{aligned} $$

Projecting back to low-rank space

However, $\tilde{\vW}_t \in \reals^{D\times d + o}$.

LoFI maintains a $(D\times o)$ low-rank matrix approximation to $\tilde{\vW}_t$ using reduced-rank singular value decomposition (SVD).

Performance

Pros update all model parameters in one equation and adaptive on all model parameters.
Cons slowest method.

Fashion MNIST time and performance

Takeaway

Scalability of neural networks is possible with working with low rank decomposition of parameter or low-rank approximations to the posterior covariance.
Choose PULSE when the the data generating process of the features are fixed through time and we have warmup data available.
Choose LoFI if adaptivity over all model parameters are important.

Conclusion

Introduce filtering as a general framework to develop novel methods for online learning and sequential decision making.

The methods developed in this thesis

Adapt to the non-stationary nature often presented in real-world datasets and ML problems,
are robust to outliers and misspecified measurement models, and
scale to high-dimensional overparameterised neural network models.

[slides] Adaptive, robust, and scalable Bayesian filtering for online learning

Online learning and sequential decision making

Examples: online learning and sequential decision making

Bandits1

Toxic buy trade in EUR/USD market (fast online classification)

Prequential (one-step-ahead) forecasting of hourly electricity load

Recursive Bayesian online learning

Parametric probabilistic graphical model

Recursive Bayesian online learning

Recursive sequential decision making through Thompson Sampling3 4

The linear Gaussian (LG) case

Remark

The (recursive) variational Bayes (VB)

State-space models: the regression case

Probabilistic graphical model

The linearised / moment-matched state-space model

Example: moment-matched linear Gaussian for classification

Sequential Online classification

Challenges of filtering as Bayesian online learning

1. Adaptivity to non-stationary environments

2. Sensitive to model misspecification and outliers

3. They are not scalable

Contributions

Adaptivity

A motivating toy example: non-stationary moons dataset

Motivating example (cont’d): moment-matched linear Gaussian

The research question

Probablistic graphical model

The modified target posterior and prediction

The BONE modification

Methods in the literature

Example: runlength with prior reset

The takeaway

Robustness

The influence of the KF to a single outlier

The wolf method: motivation

The wolf method

The Wolf algorithm

The PIF under Wolf

Example: online learning with WoLF

Scalability

A running example: Fashion-mnist classification with a convolutional neural network

Running example: cont’d

The research question

Subspace parameters

Update

Performance

Subspace-hidden last-layer parameters

Update

Performance

A low-rank Filtering algorithm

Update

Projecting back to low-rank space

Performance

Fashion MNIST time and performance

Conclusion

Bandits¹

Recursive sequential decision making through Thompson Sampling³ ⁴