[slides] Adaptive, robust, and scalable Bayesian filtering for online learning
Online learning and sequential decision making
Observe sequence of features $\vx_t\in\reals^M$ and observations $\vy_t \in \reals^o$: $$ {\cal D}_{1:t} = \{(\vx_1, \vy_1), \ldots, (\vx_t, \vy_t)\}. $$
We model a mapping from $\vx_t$ to $\vy_t$ through a function $h: \Theta \times \reals^M \to \reals^o$ that maps model parameters $\vtheta \in \Theta$ and features $\vx_t \in \reals^M$ to outputs $\vy_t \in \reals^o$.
Given $\vx_{t+1}$, make a prediction about the next value in the sequence $$ \hat{\vy}_{t+1} = \omega(h, \vtheta, \vx_{t+1}, \data_{1:t}). $$
Examples: online learning and sequential decision making
Bandits1
Let $\mathcal{A} = {a^{(1)}, \ldots, a^{(K)}}$ be a set of actions. At every time step $t=1,\ldots,T$
- we are given a context $\vx_t$
- take action $a_t \in {\cal A}$ based on $\vx_t$
- obtain a reward $\vy_t$ based on the context $\vx_t$ and the chosen action $a_t$
Our goal is to choose the set of actions that maximise the expected reward $\sum_{t=1}^T\mathbb{E}[Y_t]$.
Example:
Given a seed playlist title and/or initial set of tracks in a playlist, predict the subsequent tracks in the playlist.2
Recommendation turned bandit problem: given user’s features, past actions, and $K$ possible songs, predict probability of the percentage of listening time a particular song, pick the song that yields the highest expected percentage of listening time.
Toxic buy trade in EUR/USD market (fast online classification)
Having a trade from a client, the broker has to decide whether to keep the trade in their book (internalise) or close the position immediately.
- Trader: buys at orange line, sell in blue region
- Broker: sells in orange line, buys in blue region

Broker wants to externalise as much toxic flow as possible and internalise as much toxic flow as possible (classification problem).
Prequential (one-step-ahead) forecasting of hourly electricity load
Given weather condition $\vx_t$, predict the hour-ahead electricity load $\vy_t$.

Recursive Bayesian online learning
Parametric probabilistic graphical model
Observation $\vy_t$ is modelled through a likelihood $p(\vy_t \cond \vtheta, \vx_t)$ with conditional mean $h(\vtheta, \vx_t)$, e.g., a neural network.

Recursive Bayesian online learning
Suppose we are given the prior (initial condition) $p(\vtheta)$ and the likelihood model $p(\vy_t \cond \vtheta, \vx_t)$ with conditional mean $h(\vtheta, \vx_t)$. We seek to obtain a recursive estimate for the posterior density $$ \begin{aligned} p(\vtheta \cond \data_{1:t}) &\propto \underbrace{p(\vtheta \cond \data_{1:t-1})}_\text{prior density}\, \underbrace{p(\vy_{t} \cond \vtheta, \vx_t)}_\text{likelihood}, \end{aligned} $$
Given $\vx_{t+1}$ and $p(\vtheta)$, the prediction about the (expected) next value in the sequence is given by $$ \hat{\vy}_{t+1} = \int h(\vtheta \cond \vx_{t+1})\,p(\vtheta \cond \data_{1:t})\,\d\vtheta =: \mathbb{E}_{p(\vtheta \cond \data_{1:t})}[h(\vtheta, \vx_{t+1})]. $$
Recursive sequential decision making through Thompson Sampling3 4
Let $\data_t = (x_t, a_t, y_t)$ be the context, action, and reward at time $t$.
At every time step $t=1,\ldots, T$, we follow the following procedure:
- Sample $\vtheta_t \sim p(\cdot \vert \mathcal{D}_{1:t-1})$
- $a_t = \arg\max_{a \in \mathcal{A}} \mathbb{E}[h(\vtheta, (\vx_t, a))]$
- Obtain $\vy_t \sim p(\cdot \cond \vx_t,a_t)$
- Store $\mathcal{D}_t = (\vx_t, a_t, \vy_t)$
- Update $p(\vtheta \cond \data_{1:t}) \propto p(\vtheta \cond \data_{1:t-1}) p(\vy \cond \vtheta, \vx_t, a)$.
Example: Beta-Bernoulli bandit with $K=4$ arms.
The linear Gaussian (LG) case
Suppose $p(\vtheta) = {\cal N}(\vtheta \cond \vmu_0, \vSigma_0)$ and $p(\vy_t \cond \vtheta, \vx_t) = {\cal N}(\vy_t \cond \vtheta^\intercal\vx_t, \vR_t)$, with known $\vR_t$. Then, $$ p(\vtheta \cond \data_{1:t}) = {\cal N}(\vtheta \cond \vmu_t, \vSigma_t), $$ with
$$ \begin{aligned} \vS_t &= \vx_t^\intercal\vSigma_{t-1}\vx_t + \vR_t,\\ \vK_t &= \vSigma_{t-1}\vx_t\vS_t^{-1},\\ \vmu_t &= \vmu_{t-1} + \vK_t\,(\vy_t - \vx_t^\intercal\,\vmu_{t-1}),\\ \vSigma_t &= \left(\vI - \vK_t\vx_t\right)\vSigma_{t-1}. \end{aligned} $$The maximum-a-posteriori (MAP) prediction having $\vx_{t+1}$ and $\data_{1:t}$ is $$ \hat{\vy}_{t+1} = \vmu_{t}^\intercal\,\vx_{t+1}. $$
Remark
The linear gaussian case with $\vmu_0 = {\bf 0}$, $\vSigma_0 = \alpha^2,\vI$, and $\vR_t = \beta^2$ for all $t \in {1,\ldots, T}$. matches the Ridge regression algorithm.

- Takeaway: we seek methods that update parameters (beliefs) learn one-data point at a time.
The (recursive) variational Bayes (VB)
For cases when $p(\vy_t \cond \vtheta, \vx_t)$ is non-linear, non-Gaussian.
$$ q_t(\vtheta) = \arg\min_{q \in {\cal Q}} {\rm KL}\left({q(\vtheta)} \,\|\, p(\vy_t \cond \vtheta)\,q_{t-1}(\vtheta)\,/\,Z_t\right), $$yields the fixed-point equations $$ \begin{aligned} \vmu_t &= \vmu_{t-1} + \vSigma_{t-1}\,\mathbb{E}_{q_{t}}[\nabla_\vtheta \log p(\vy_t \cond \vtheta, \vx_t)],\\ \vSigma_t^{-1} &= \vSigma_t^{-1} - \mathbb{E}_{q_{t}}\left[\nabla^2_\vtheta \log p(\vy_t \cond \vtheta, \vx_t)\right]. \end{aligned} $$
State-space models: the regression case
Generalising the sequential online learning problem assuming that parameters evolve over time.
$$ \begin{aligned} p(\vtheta_t \cond \vtheta_{t-1}) &= {\cal N}(\vtheta \cond f(\vtheta_{t-1}), \vQ_t),\\ p(\vy_t \cond \vtheta, \vx_t) &= {\cal N}(\vy_t \cond h(\vtheta_t, \vx_t), \vR_t). \end{aligned} $$Why?
- Tackling misspecification.
- Numerical stability.
Probabilistic graphical model

The linearised / moment-matched state-space model
Take $f(\vtheta) = \vtheta$ (random walk assumption) and suppose $h(\vtheta, \vx_t)$ is differentiable w.r.t. $\vtheta$. A first-order approximation of $h$ around the previous mean $\vmu_{t-1}$ yields $$ h(\vtheta, x) \approx \bar{h}_t(\vtheta, x) = H_t\,(\vtheta_t - \vmu_{t-1}) + h(\vmu_{t-1}, x_t). $$
$$ p(y_t \vert \vtheta, x_t) \approx {\cal N}(y_t\,\vert\,\hat{y}_t, R_t), $$ where $R_t$ is the moment-matched variance of the observation process.
The MAP prediction having $\vx_{t+1}$ and $\data_{1:t}$ is $$ \hat{\vy}_{t+1} = \mathbb{E}[\bar{h}(\vtheta_t, \vx_{t+1}) \vert {\cal D}_{1:t}] = h(\vmu_{t}, \vx_{t+1}). $$
Example: moment-matched linear Gaussian for classification
$$ \begin{aligned} \vH_t &= \nabla_\theta h(\vmu_{t-1}, \vx_t) & \text{(Jacobian)}\\ \hat{\vy}_t & = h(\vmu_{t-1}, \vx_t) & \text{ (MAP prediction)} \\ \vR_t &= \hat{\vy}_t\,(1 - \hat{\vy}_t) & \text{ (moment-matched variance)}\\ \vSigma_t^{-1} &= \vSigma_{t-1}^{-1} + \vH_t^\intercal\,\vR_t^{-1}\,\vH_t & \text{(posterior precision)}\\ {\bf K}_t &= \vSigma_{t}\,\vH_t\,{\vR}_t^{-1} & \text{(gain matrix)}\\ \hline \vmu_t &\gets \vmu_{t-1} + \vK_t\,(\vy_t - \hat{\vy}_t) & \text{(update)} \end{aligned} $$Sequential Online classification
Online Bayesian learning using a single hidden-layer neural network and moment-matched updates.

Challenges of filtering as Bayesian online learning
1. Adaptivity to non-stationary environments
- The methods above (LG and VB) assume that the data generating process is fixed / known.
- Mean estimate $\mu_T$ under LG converges to a point estimate as $T \to \infty$ regardless of error, i.e., $\Sigma_T \to 0 {\bf I}$.
- They are sequential, but do not have a notion of non-stationarity (model misspecification).
2. Sensitive to model misspecification and outliers
- Gaussian assumption over model parameters and linearised likelihood renders the posterior sensitive to outliers and model misspecification.
3. They are not scalable
- Overall computational cost depends on model size.
- For EKF, e.g., is $O(D^2)$ in memory and $O(D^3)$ in time.
Contributions
- a modular framework that enables the development adaptive approaches for online learning;
- a novel, provably robust filter with similar computational cost to standard filters, that employs Generalised Bayes; and
- a set of tools for sequentially updating model parameters using approximate second-order optimisation methods that exploit the overparametrisation of high-dimensional parametric models such as neural networks.
Adaptivity
Generalised Bayesian online learning in non-stationary environments: BONE.
A motivating toy example: non-stationary moons dataset

Motivating example (cont’d): moment-matched linear Gaussian

The research question
How do we adapt to changes in the data-generating process when the underlying dynamics are unknown?
Probablistic graphical model
Introduce an auxiliary variable $\psi_t$ that models the non-stationarity.

The modified target posterior and prediction
We seek to obtain recursive estimates for the posterior density $$ \begin{aligned} p(\vtheta_t, \psi_t \cond \data_{1:t}) &\propto p(\vtheta_t, \psi_t \cond \data_{1:t-1})\, p(\vy_{t} \cond \vtheta_t, \vx_t)\\ &= p(\psi_t \cond \data_{1:t-1})\, p(\vtheta_t \cond \psi_t, \data_{1:t-1})\, p(\vy_{t} \cond \vtheta_t, \vx_t). \end{aligned} $$
A one-step-ahead prediction is given by $$ \begin{aligned} \hat{\vy}_{t+1} &= \mathbb{E}_{p(\vtheta_t, \psi_t \cond \data_{1:t})}[h(\vtheta_t,\vx_{t+1})]\\ &= \sum_{\psi_t \in \Psi_t}p(\psi_t \cond \data_{1:t}) \int h(\vtheta_t, \vx_{t+1}) p(\vtheta_t \cond \psi_t, \data_{1:t})\d\vtheta_t. \end{aligned} $$
remark: hard to specify all components and its dynamics
The BONE modification
We find the many methods in the literature can be written as a modified (generalised Bayes) version of the posterior predictive.
$$ \begin{aligned} &q(\vtheta_t;\,\psi_t, \data_{1:t}) \propto \overbrace{\pi(\vtheta_t;\, \psi_t, \data_{1:t-1})}^\text{\rm prior}\, \overbrace{\exp(-\ell(\vy_t; \vtheta_t, \vx_t))}^\text{\rm loss for model},\\ &\hat{\vy}_t := \sum_{\psi_t \in \Psi_t} \underbrace{\nu(\psi_t \cond \data_{1:t})}_{\text{\rm weight}}\, \int \underbrace{h(\vtheta_t, \vx_{t+1})}_{\text{\rm model}}\, \underbrace{q(\vtheta_t;\, \psi_t, \data_{1:t})}_{\text{\rm posterior}} \d\vtheta_t. \end{aligned} $$Method specification
- Auxiliary variable $\psi$: model for changes in the data-generating process.
- Weight $\nu$: belief about value of auxiliary variable.
- (Conditional) prior $\pi$: ad-hoc modification on parameters before observing new datapoint
- Posterior $q$: generalised-Bayes posterior (prior X lossfn).
- Model $h$: mapping from parameters and features to outputs.
Methods in the literature

Example: runlength with prior reset
Online Bayesian learning using (M.1) a single hidden-layer neural network and (A.1) moment-matched LG with a runlenght auxiliary variable.

The takeaway
- There are various methods in the literature that tackle the problem of adapting to non-stationary environments
from a Bayesian perspective:
- Segmentation,
- prequential forecasting,
- bandits, and
- online continual learning.
We show that these methods are all instances of a unifying framewosk we call BONE: generalised (B)ayesian (O)nline learning in (N)on-stationary (E)nvironments.
Robustness
The weighted-observation likelihood filter.
Online learning and the problem of outliers

The influence of the KF to a single outlier
For posterior probability density $q$, let the posterior influence function (PIF) $$ {\rm PIF}_q(\vy_t^{c},\data_{1:t}) = {\rm KL} \left( q(\vtheta_t \cond \data_{1:t}^c) \,\|\, q(\vtheta_t \cond \data_{1:t}) \right), $$ where $\data_{1:t}^c = \data_{1:t-1} \cup (\vx_t, \vy^c)$.
Then, for KF, the influence is unbounded, i.e., $$ \sup_{\vy_t^c \in \reals^o}|{\rm PIF}_q(\vy_t^{c},\data_{1:t})| = \infty. $$
The wolf method: motivation
Many methods in the literature that robustify against outliers. We seek one that is
- easy to implement,
- provably robust (PIF sense),
- scales at a similar compute cost to KF.
The wolf method
$$ q(\vtheta_t \cond \data_{1:t}) \propto \exp(-\ell_t(\vtheta_t))\,q(\vtheta_t \cond \data_{1:t-1}). $$with $$ \ell_t(\vtheta_t) = -W^2(\vy_t, \hat{\vy}_t)\,\log{\cal N}\left(\vy_t \cond \hat{\vy}_t\, \vR_t\right). $$
The Wolf algorithm
Under linear dynamics, the WoLF algorithm modifies the update equations of the KF to be:
$$ q_{\rm WOLF}(\vtheta_{t} \cond \data_{1:t}) = {\cal N}(\vtheta_t \cond \vmu_{t}, \vSigma_{t}). $$with, $$ \begin{aligned} \hat{\vy}_t &= \vH_t\,\vmu_{t|t-1}\\ \vS_t &= \vH_t\,\vSigma_{t|t-1}\,\vH_t^\intercal + \vR_t / {\color{orange}{W(\vy_t, \hat{\vy}_t)}},\\ \vK_t &= \vSigma_{t|t-1}\vH_t^\intercal\,\vS_t^{-1},\\ \vmu_t &= \vmu_{t|t-1} + \vK_t\,(\vy_t - \hat{\vy}_t),\\ \vSigma_t &= \left(\vI - \vK_t\vH_t^\intercal\right)\vSigma_{t|t-1}, \end{aligned} $$ and $$ W(\vy_t, \hat{\vy}_t) =\left(1+\frac{\|\vR_t^{-1/2}(\vy_t - \hat{\vy}_t)\|_2^2}{c^{2}}\right)^{-1/2}. $$
The PIF under Wolf
Then, for WoLF, the influence is bounded, i.e., $$ \sup_{\vy_t^c \in \reals^o}|{\rm PIF}_{q_{\rm wolf}}(\vy_t^{c},\data_{1:t})| < \infty. $$
Example: online learning with WoLF

Scalability
- An exploration on SSM assumptions and ways to update the SSMs in high-dimensional, misspecified models: subspace, last-layer and subspace, and low-rank VB.
A running example: Fashion-mnist classification with a convolutional neural network
- Consider a modified LeNet5 convolutional neural network with
d=62_556
parameters. - We seek to perform online learning and classification of the Fashion MNIST dataset.

Running example: cont’d
Under EKF of CNN, we require storage of mem_gb = d ** 2 * 32 * 0.125 * 10 ** (-9) == 125.2
(gigabyte) per step.
The research question
How do we perform efficient and scalable online learning with deep neural networks?
Subspace parameters
Use lottery ticket hypothesis for scalable online learning.
For a neural network with $D$ parameters, rewrite SSM so that inference occurs in a subspace vector $\vz \in \reals^d$, where $d \ll D$.
$$ \begin{aligned} \vz_t &= \vz_{t-1} + \vu_t,\\ \vtheta_t &:= {\bf A}\,\vz_t + \vtheta_*,\\ \vy_t &= h(\vtheta_t, \vx_t) + \ve_t, \end{aligned} $$ with $\vA \in \reals^{D\times d}$.
Update
Follows KF-like equations with modified projection matrix $\vA$.
$$ \begin{aligned} \vH_t &= \nabla_\vtheta h(\vA\,\vmu_{t|t-1} + \vtheta_*, \vx_t)\\ \vS_t &= \vH_t\,\vA\,\vSigma_{t-1}\,\vA^\intercal\,\vH_t^\intercal + \vR_t\\ \vK_t &= \vSigma_{t|t-1}\,\vA^\intercal\,\vH_t^\intercal\,\vS_t^{-1}\\ \vmu_t &= \vmu_{t-1} + \vK_t(\vy_t - \vH_t\,\vA\,\vmu_{t|t-1})\\ \vSigma_t &= \vSigma_{t|t-1} - \vK_t\,\vA^\intercal\,\vH_t^\intercal\vSigma_{t|t-1} \end{aligned} $$Performance
Pros: Efficient and simple to implement.
Cons Required estimation of $\vA$ offline.

Subspace-hidden last-layer parameters
Use lottery ticket hypothesis and last-layer neural networks for scalable online learning.
Divide last-layer and model parameters, then perform online VB update.
$$ \begin{aligned} \vz_t &= \vz_{t-1} + \vu_{t}^{\rm hidden},\\ \vw_t &= \vw_{t-1} +\vu_t^{\rm last},\\ \vtheta_t &= ({\bf A}\,\vz_t + \psi_*, \vw_t),\\ \vy_t &= h(\vtheta_t, \vx_t) + \ve_t, \end{aligned} $$Update
$$ q_t(\vw, \vz) = {\cal N}(\vw \cond \nu_t, \Gamma_t)\,{\cal N}(\vz \cond \mu_t, \Sigma_t), $$ with $$ \nu_t, \mu_t, \Gamma_t, \Sigma_t = \argmin_{\nu, \mu, \Gamma, \Sigma} {\rm KL} \left( {\cal N}(\vw \cond \nu, \Gamma)\, {\cal N}(\vz \cond \mu, \Sigma) \,\|\, q_{t-1}(\vw, \vz)\, p(\vy_t \cond \vtheta, \vx_t) \right). $$
Performance
Pros explicit modelling of the last-layer and better performance than subspace method.
Cons required estimation of $\vA$ offline and slower than the subspace methods.

A low-rank Filtering algorithm
Use factor analysis ideas to keep track of a low-rank plus diagonal precision matrix.
Assume a diagonal plus low-rank (D+LR) precision matrix at time $t-1$ then predict, update and and approximate a new D+LR precision matrix.
$$ \begin{aligned} \vtheta_t &= \vtheta_{t-1} + \vu_t,\\ \vy_t &= h(\vtheta_t, \vx_t) + \ve_t, \end{aligned} $$Update
Suppose $$ q_{t-1}(\vtheta_{t-1}) = {\cal N}\Big(\vtheta_{t-1} \cond \mu_{t-1}, \left(\vW_{t-1}\,\vW_{t-1}^\intercal + \Upsilon_{t-1}\right)^{-1}\Big). $$
Then, $$ \begin{aligned} p(\vtheta_t \cond \data_{1:t-1}) &\propto p(\vy_t \cond \vx_t, \vtheta_t)\,q_{t-1}(\vtheta)\\ &={\cal N}\left(\vtheta_t \cond \vmu_t, \vSigma_{t}^{-1}\right), \end{aligned} $$ with $$ \begin{aligned} \ve_t &= \vy_t - \hat{\vy}_t\,,\\ \vmu_t &= \vmu_{t|t-1} + \left[ \Upsilon_{t|t-1}^{-1} - \Upsilon_{t|t-1}^{-1}\,\tilde{\vW_t} \left( \vI_{d+o} + \tilde{\vW}_t^\intercal\,\Upsilon_{t|t-1}^{-1}\,\tilde{\vW}_t \right)^{-1}\,\tilde{\vW}_t^\intercal\,\Upsilon_{t|t-1}^{-1} \right]\vH_t^\intercal\,\vR_t^{-1}\,\ve_t,\\ \vSigma_{t}^{-1} &= \Upsilon_{t|t-1} + \tilde{\vW}_t\,\tilde{\vW}_t^\intercal. \end{aligned} $$
Projecting back to low-rank space
However, $\tilde{\vW}_t \in \reals^{D\times d + o}$.
- LoFI maintains a $(D\times o)$ low-rank matrix approximation to $\tilde{\vW}_t$ using reduced-rank singular value decomposition (SVD).
Performance
Pros update all model parameters in one equation and adaptive on all model parameters.
Cons slowest method.

Fashion MNIST time and performance

Takeaway
- Scalability of neural networks is possible with working with low rank decomposition of parameter or low-rank approximations to the posterior covariance.
- Choose PULSE when the the data generating process of the features are fixed through time and we have warmup data available.
- Choose LoFI if adaptivity over all model parameters are important.
Conclusion
Introduce filtering as a general framework to develop novel methods for online learning and sequential decision making.
The methods developed in this thesis
- Adapt to the non-stationary nature often presented in real-world datasets and ML problems,
- are robust to outliers and misspecified measurement models, and
- scale to high-dimensional overparameterised neural network models.