[slides] Bayesian online learning in non-stationary environments

Posted on Dec 4, 2024
tl;dr: A unifying framework of generalised Bayesian methods for online learning in non-stationary environments.

(B)ayesian (O)nline learning in (N)on-stationary (E)nvironments


Motivation

In recent years there’s been a surge in developing machine learning (ML) methods that tackle non-stationarity.

  • Contextual bandits.
  • Test-time adaptation.
  • Reinforcement learning.
  • (Online) continual learning — catastrophic forgetting / plasticity-stability tradeoff.
  • Dataset shift — covariance shift, prior probability shift, domain shift, etc.

Various Bayesian methods have been proposed to tackle some of the problems above.

However, there hasn’t been a clear way to distinguish these Bayesian methods and there hasn’t been enough recognition of methods outside the ML literature that also tackle non-stationarity (and that can be applied to ML problems).


The BONE framework:

  • Unify various methods in different fields that tackle inference under non-stationarity
    • machine learning — contextual bandits / continual learning / reinforcement learning
    • statistics — segmentation / switching state-space models
    • engineering — system identification / filtering

Why the BONE framework?

Allows us to

  1. Categorise existing methods under a common (algorithmic) language.
  2. Plug-and-play implementation in Jax: github.com/gerdm/BONE.
  3. Extend application of existing methods (use method developed for field A and apply to field B).
  4. Develop new methods.
  • BONE methods be seen as a form of generalised state-space model whose inference requires
    • 3 modelling choices and
    • 2 algorithmic choices.

Online (incremental / streaming / sequential) learning

  • Targets $\vy_t \in \reals^o$, features $\vx_t\in\reals^M$.
  • Datapoint $\data_t = (\vx_t, \vy_t)$.
  • Dataset $\data_{1:t} = (\data_1, \ldots, \data_t)$
  • Goal: Estimate $y_{t+1}$ given $x_{t+1}$ and $\data_{1:t}$
dataset incremental

Parametric Bayes for online learning: Choice of (M.1) and (A.1)

Let $h(\vtheta, \vx) = \mathbb{E}[\vy \cond \vtheta, \vx]$, where $h: \reals^D\times\reals^M \to \reals^o$ be a model for the mean of $\vy$, conditioned on latent variables (parameters) $\vtheta_t$ and features $\vx_t$ (the measurement model), e.g., a neural network (M.1).

Given $\vx_{t+1}$ and $\data_{1:t}$, the prediction (posterior predictive mean) is $$ \hat{\vy}_{t+1} = \mathbb{E}[h(\vtheta_t, \vx_{t+1}) \cond \data_{1:t}] = \int \underbrace{h(\vtheta_t, \vx_{t+1})}_\text{measurement fn.} \overbrace{p(\vtheta_t \cond \data_{1:t})}^\text{posterior density} \,{\rm d}\vtheta_t. $$

Bayesian online learning amounts to finding a sequence of posterior densities.

$$ \begin{aligned} p(\vtheta) &\to& p(\vtheta \cond \data_1) &\to& p(\vtheta \cond \data_{1:2}) &\to& \ldots &\to& p(\vtheta \cond \data_{1:t})\\ \downarrow & & \downarrow & & \downarrow & & & & \downarrow \\ \hat{\vy}_{1} & & \hat{\vy}_{2} & & \hat{\vy}_{3} & & & & \hat{\vy}_{t+1} \\ \uparrow & & \uparrow & & \uparrow & & & & \uparrow \\ \vx_1 & & \vx_2 & & \vx_2 & & & & \vx_{t+1} \\ \end{aligned} $$

Recursive posterior estimation (A.1)

$$ p(\vtheta \cond \data_{1:t}) \propto \underbrace{ p(\vtheta \cond \data_{1:t-1}) }_\text{prior} \, \underbrace{ p(\vy_t \cond \vtheta, x_t). }_\text{likelihood} $$
  • Closed-form solution intractable except in a few special cases, e.g., linear Gaussian with Gaussian priors or conjugate priors.
  • Need to specify a density for $\vy$ whose conditional mean (given $\vtheta$ and $\vx$) is $h(\vtheta, \vx)$.
  • In most cases, we resort to an approximation. We denote the approximation to the posterior by the algorithmic choice (A.1).

(Recursive) variational Bayes methods

Suppose $p(\vtheta \cond \data_{1:t-1}) = {\cal N}(\vtheta \cond \mu_{t-1}, \Sigma_{t-1})$: $$ \mu_t, \Sigma_t = {\bf D}_\text{KL} \left( {\cal N}(\vtheta \cond \mu, \Sigma) \,\|\, p(\vtheta \cond \data_{1:t-1})\,p(y_t \cond \vtheta, \vx_t). \right). $$

A convinient choice: density is specified by the first two moments only.


Moment-matched linear Gaussian (LG)

  • Linearise measurement function (M.1) around the previous mean $\vmu_{t-1}$ and model likelihood as linear Gaussian.
  • Suppose $p(\vtheta \cond \data_{1:t-1}) = {\cal N}(\vtheta \cond \vmu_{t-1}, \vSigma_{t-1})$. Consider the likelihood $$ \begin{aligned} h(\vtheta, x) &\approx \bar{h}_t(\vtheta, x) = H_t\,(\vtheta_t - \vmu_{t-1}) + h(\vmu_{t-1}, x_t),\\ p(\vy_t \cond \vtheta, \vx_t) &\approx {\cal N}(\vy_t \cond \hat{\vy}_t, R_t), \end{aligned} $$ with $R_t$ is the moment-matched variance of the observation process.

Then $$ p(\vtheta \cond \data_{1:t}) = {\cal N}(\vtheta \cond \vmu_t, \vSigma_t). $$

The one-step-ahead forecast is $$ \begin{aligned} \hat{\vy}_{t+1} &= \mathbb{E}[\bar{h}(\vtheta_t, \vx_{t+1}) \cond \data_{1:t}] = h(\vmu_{t}, \vx_{t+1}). \end{aligned} $$


A running example: online classification with neural networks

  • Online Bayesian learning using (M.1) a two hidden-layer neural network and (A.1) moment-matched LG.
  • Data is seen only once.
  • We evaluate the exponentially-weighted moving average of the accuracy in the one-step-ahead forecast.
linreg

Non-linear classification (M.1)

$$ h(\theta, x) = \sigma(\phi_\theta(x)), $$ with $\phi_\theta: \reals^M \to \reals$ a two-layered neural network with real-valued output unit. Then, $p(y_t \cond \theta, x_t) = {\rm Bern}(y_t \cond h(\theta, x_t))$.


Example: moment-matched linear Gaussian for classification

Bayes rule under linearisation and Gaussian assumption (second-order SGD-type update).

$$ \begin{aligned} \vH_t &= \nabla_\theta h(\vmu_{t-1}, \vx_t) & \text{(Jacobian)}\\ \hat{y}_t & = h(\vmu_{t-1}, \vx_t) & \text{ (one-step-ahead prediction)} \\ \vR_t &= \hat{y}_t\,(1 - \hat{y}_t) & \text{ (moment-matched variance)}\\ \vS_t &= \vH_t\,\vSigma_{t-1}\,\vH_t^\intercal + \vR_t\\ {\bf K}_t &= \vSigma_{t-1}\vH_t\,\vS_t^{-1} & \text{(gain matrix)}\\ \hline \vmu_t &\gets \vmu_{t-1} + {\bf K}_t\,(y_t - \hat{y}_t) & \text{(update mean)}\\ \vSigma_t &\gets \vSigma_{t-1} - \vK_t\,\vS_t\vK_t^\intercal & \text{(update covariance)}\\ p(\vtheta_t \cond \data_{1:t}) &\gets {\cal N}(\vtheta_t \cond \vmu_t, \vSigma_t) &\text{(posterior density)} \end{aligned} $$

Online classification with neural networks

$$ \begin{aligned} p(\vtheta_0) &\to& p(\vtheta_1 \cond \data_1) &\to& \ldots &\to& p(\vtheta_t \cond \data_{1:t})\\ (\vmu_0, \vSigma_0) &\to& (\vmu_1, \vSigma_1) &\to& \ldots &\to& (\vmu_t, \vSigma_t)\\ & & \uparrow & & & & \uparrow\\ & & \data_1 & & \ldots & & \data_t. \end{aligned} $$ sequential classification with static dgp

Changes in the data-generating process

Simply applying Bayes rule on a parametrised model fails under model misspecification (pretty much ML problem) and lack of model capacity.

In this case, conditioning on more data does not lead to better performance.


Non-stationary moons dataset: an online continual learning example

DGP changes every 200 steps. Agent is not aware of changepoints.

Goal: Estimate the class $\vy_{t+1}$ given data $\data_{1:t}$ and $\vx_{t+1}$.

non-stationary-moons-split

The full dataset (without knowledge of the task boundaries)

non-stationary-moons-full

Static Bayesian updates — non-stationary moons

  • Online Bayesian learning using (M.1) a single hidden-layer neural network and (A.1) moment-matched LG.
  • Keep applying Bayes rule as new data streams in.
  • We observe so-called lack of plasticity.
sequential classification with varying dgp

Is being “Bayesian” is not enough for adaptivity?

Some machine learning works have implied that being “Bayesian” endows adaptivity and that is the lack of a good approximation to the posterior that limits the ability to adapt.

Continual learning […] is a form of online learning in which data continuously arrive in a possibly non i.i.d way. […] There already exists an extremely general framework for continual learning: Bayesian inference. — Nguyen et al., “Variational Continual Learning” (2017)

  • In our work, we argue that the approximation to the posterior (A.1) and choice of measurement model (M.1) can be considered separately from the adaptation strategy.

Tackling non-stationarity in a Bayesian way: we cannot update what we don’t model.

Fix (M.1) and (A.1). Non-stationarity is modelled through

  • (M.2) an auxiliary variable,
  • (M.3) the effect of the auxiliary variable on the prior, and
  • (A.2) a posterior over choices of auxiliary variables.

Considering classes of (M.2) and (M.3) recovers various ways in which non-stationarity has been tackled.

overview of BONE methods

Tracking regimes through an auxiliary variable (M.2)

  1. Denote $\psi_t$ the auxiliary variable.
  2. The nature of $\psi_t$ determines how we track regimes.

Runlenght (RL)

  • Number of timesteps since the last changepoint (adaptive lookback window).
  • $\psi_t = r_t \geq 0 $.
Runlength auxiliary variable

Runlenght with changepoint count (RLCC)

  • Number of timesteps since the last changepoint (lookback window) and count of the number of changepoints.
  • $\psi_t = (r_t, c_t)$.
Runlength and changepoint count auxiliary variable

Changepoint location (CPL)

  • Binary sequence of changepoint locations.
  • Encodes number of timesteps since last changepoint, count of changepoints, and location of changepoints.
  • $\psi_t = s_{1:t} \in \{0,1\}^t$.
Changepoint location auxiliary variable

Changepoint location (CPL) alt.

  • Binary sequence of values belonging to the current regime.
  • Allows for non-consecutive data conditioning.
  • $\psi_t = s_{1:t} \in \{0,1\}^t$.
Changepoint location auxiliary variable

Changepoint probability (CPP)

  • Changepoint probabilities.
  • $\psi_t = \nu_t \in [0,1]$.
Changepoint probability auxiliary variable

Mixture of experts (ME)

  • Choices of over a fixed number of models.
  • $\psi_t = \alpha_t \in \{1, \ldots, K\}$.
Mixture of experts auxiliary variable

Constant (C)

  • Single choice of model.
  • $\psi_t = c$.
Constant auxiliary variable

Summary of auxiliary variables

table of auxiliary variables

Modifying beliefs based on regime (M.3)

  • Construct the posterior based on two modelling choices:
    • (M.3) a conditional prior $\pi$ and
    • (M.1) a choice of loss function $\ell$.
$$ \underbrace{q(\vtheta_t; \psi_t, \data_{1:t})}_{\text{posterior (A.1)}} \propto \underbrace{\pi(\vtheta_t \cond \psi_t, \data_{1:t-1})}_\text{past information (M.3)}\, \overbrace{ \exp\left(-\ell(\vy_t;\,\vtheta_t, \vx_t)\right) }^\text{current information (M.1)} $$

The idea of the conditional prior $\pi$ as a “forgetting operator” is formalised in Kullhavy and Zarrop, 1993.


Choice of conditional prior (M.3) — the Gaussian case

$$ \pi(\vtheta_t \cond \psi_t,\, \data_{1:t-1}) = {\cal N}\big(\vtheta_t \cond g_{t}(\psi_t, \data_{1:t-1}), G_{t}(\psi_t, \data_{1:t-1})\big), $$
  • $g_t(\cdot, \data_{1:t-1}): \Psi_t \to \reals^m$ — mean vector of model parameters.
  • $G_t(\cdot, \data_{1:t-1}): \Psi_t \to \reals^{m\times m}$ — covariance matrix of model parameters.

C-Static — constant update with static auxvar

  • $\psi_t = c$.
  • Classical (static) Bayesian update: $$ \begin{aligned} g_t(c, \data_{1:t-1}) &= \mu_{t-1}\\ G_t(c, \data_{1:t-1}) &= \Sigma_{t-1}\\ \end{aligned} $$

RL-PR — runlength with prior reset

  • $\psi_t = r_t$.
  • Corresponds to the Bayesian online changepoint detection (BOCD) algorithm (Adams and MacKay, 2007).
$$ \begin{aligned} g_t(r_t, \data_{1:t-1}) &= \mu_0\,\mathbb{1}(r_t = 0) + \mu_{(r_{t-1})}\mathbb{1}(r_t > 0),\\ G_t(r_t, \data_{1:t-1}) &= \Sigma_0\,\mathbb{1}(r_t = 0) + \Sigma_{(r_{t-1})}\mathbb{1}(r_t > 0),\\ \end{aligned} $$

where $\mu_{(r_{t-1})}, \Sigma_{(r_{t-1})}$ denotes the posterior belief using observations from indices $t - r_t$ to $t - 1$. $\mu_0$ and $\Sigma_0$ are pre-defined prior mean and covariance.


CPP-OU — changepoint probability with Ornstein-Uhlenbeck process

  • $\psi_t = \upsilon_t$.
  • Mean reversion to the prior as a function of the probability of a changepoint: $$ \begin{aligned} g(\upsilon_t, \data_{1:t-1}) &= \upsilon_t \mu_{t-1} + (1 - \upsilon_t) \mu_0 \,,\\ G(\upsilon_t, \data_{1:t-1}) &= \upsilon_t^2 \Sigma_{t-1} + (1 - \upsilon_t^2) \Sigma_0\,. \end{aligned} $$

CPL-sub — changepoint location with subset of data

  • $\psi_t = s_{1:t}$.

$$ \begin{aligned} g_t(s_{1:t}, \data_{1:t-1}) &= \mu_{(s_{1:t})},\\ G_t(s_{1:t}, \data_{1:t-1}) &= \Sigma_{(s_{1:t})},\\ \end{aligned} $$ where $\mu_{(s_{1:t})}, \Sigma_{(s_{1:t})}$ denote the posterior beliefs using observations that are 1-valued.


C-ACI — constant with additive covariance inflation

  • $\psi_t = c$.
  • Random-walk assumption. Inject noise at every new timestep. Special case of a linear state-space-model (SSM). $$ \begin{aligned} g_t(c, \data_{1:t-1}) &= \mu_{t-1},\\ G_t(c, \data_{1:t-1}) &= \Sigma_{t-1} + Q_t.\\ \end{aligned} $$ Here, $Q_t$ is a positive semi-definitive matrix. Typically, $Q_t = \alpha {\bf I}$ with $\alpha > 0$.

A prediction conditioned on $\psi_t$ and $\data_{1:t}$

$$ \hat{y}_{t+1}^{(\psi_t)} = \mathbb{E}_{q_t}[h(\theta_t;\, x_{t+1}) \cond \psi_t] := \int h(\theta_t;\, x_{t+1})\,q(\theta_t;\,\psi_t, \data_{1:t})d \theta_t\,. $$

(A.2) Weighting function for regimes

An algorithmic choice to weight over elements $\psi_t \in \Psi_t$.


(A.2) The recursive Bayesian choice

$$ \begin{aligned} \nu_t(\psi_t) &= p(\psi_t \cond \data_{1:t})\\ &= p(y_t \cond x_t, \psi_t, \data_{1:t-1})\, \sum_{\psi_{t-1} \in \Psi_{t-1}} p(\psi_{t-1} \cond \data_{1:t-1})\, p(\psi_t \cond \psi_{t-1}, \data_{1:t-1}). \end{aligned} $$

For some $\psi_t$ and $p(\psi_t \cond \psi_{t-1})$, this method yields recursive update methods.


(A.2) A loss-based approach

Suppose $\psi_t = \alpha_t \in \{1, \ldots, K\}$, take $$ \nu_t(\alpha_t) = 1 - \frac{\ell(y_{t+1}, \hat{y}_{t+1}^{(\alpha_t)})} {\sum_{k=1}^K \ell(y_{t+1}, \hat{y}_{t+1}^{(k)})}, $$ with $\ell$ a loss function (lower is better).


(A.2) An empirical-Bayes (point-estimate approach)

For $\psi_t = \upsilon_t \in [0,1]$,

$$ \upsilon_t^* = \argmax_{\upsilon \in [0,1]} p(y_t \cond x_t, \upsilon, \data_{1:t-1}). $$ Then, $$ \nu(\upsilon_t) = \delta(\upsilon_t = \upsilon_t^*). $$


BONE — Bayesian online learning in non-stationary environments

  • (M.1) A model for observations (conditioned on features $x_t$) — $h(\theta, x_t)$.
  • (M.2) An auxiliary variable for regime changes — $\psi_t$.
  • (M.3) A model for prior beliefs (conditioned on $\psi_t$ and data $\data_{1:t-1}$) — $\pi(\theta \cond \psi_t, \data_{1:t-1})$.
  • (A.1) An algorithm to weight over choices of $\theta$ (conditioned on data $\data_{1:t}$) — $q(\theta;\,\psi_t, \data_{1:t}) \propto \pi(\theta \cond \psi_t, \data_{1:t-1}) p(y_t \cond \theta, x_t)$.
  • (A.2) An algorithm to weight over choices of $\psi_t$ (conditioned on data $\data_{1:t}$).

The BONE framework is motivated by the following state-space model

Having measurement model (M.1) and algorithm to update model parameters (A.1), specify

  1. auxiliary variable to track regimes (M.2),
  2. how to modify beliefs based on the auxiliary variable (M.3), and
  3. probability (weights) over possible regimes (A.2).

Inference can be seen as a generalised form of

BONE SSM

BONE (generalised) posterior predictive

$$ \begin{aligned} \hat{\vy}_t &:= \sum_{\psi_t \in \Psi_t} \underbrace{\nu(\psi_t \cond \data_{1:t})}_{\text{(A.2: weight)}}\, \int \underbrace{h(\theta_t, \vx_{t+1})}_{\text{(M.1: model)}}\, \underbrace{q(\theta_t;\, \psi_t, \data_{1:t})}_{\text{(A.1: posterior)}} d\theta_t, \end{aligned} $$ with $$ q(\theta_t;\,\psi_t, \data_{1:t}) \propto \underbrace{\pi(\theta_t;\, \psi_t, \data_{1:t-1})}_\text{(M.3: prior)}\, \underbrace{\exp(-\ell(\vy_t; \theta_t, \vx_t))}_\text{(M.1: loss)} $$


Back to the non-stationary moons example

Suppose measurement model (M.1) is a two hidden layer neural network with linearised moment-matched Gaussian (A.1).

Consider three combinations of (M.2), (M.3), and (A.2):

  1. Runlenght with prior reset and a single hypothesis — RL[1]-PR.
  2. Changepoint probability with OU dynamics — CPP-OU.
  3. Constant auxiliary variable with additive covariance inflation — C-ACI.

RL[1]-PR

  • When changepoint detected: reset back to initial weights
  • Auxiliary variable (M.1) proposed in Adams and Mackay, 2007 — BOCD. $$ \begin{aligned} g_t(r_t, \data_{1:t-1}) &= \mu_0\,\mathbb{1}(r_t = 0) + \mu_{(r_{t-1})}\mathbb{1}(r_t > 0),\\ G_t(r_t, \data_{1:t-1}) &= \Sigma_0\,\mathbb{1}(r_t = 0) + \Sigma_{(r_{t-1})}\mathbb{1}(r_t > 0).\\ \end{aligned} $$ rl-pr-sequential-classification

CPP-OU

  • Revert to prior proportional to the probability of a changepoint.
  • Auxiliary variable (M.1) proposed in Titsias et. al., 2023 to OCL in classification. $$ \begin{aligned} g(\upsilon_t, \data_{1:t-1}) &= \upsilon_t \mu_{t-1} + (1 - \upsilon_t) \mu_0 \,,\\ G(\upsilon_t, \data_{1:t-1}) &= \upsilon_t^2 \Sigma_{t-1} + (1 - \upsilon_t^2) \Sigma_0\,. \end{aligned} $$ cpp-ou-sequential-classification

C-ACI

  • Constantly forget past information.
  • Used in Chang et. al., 2023 for scalable non-stationary online learning. $$ \begin{aligned} g_t(c, \data_{1:t-1}) &= \mu_{t-1},\\ G_t(c, \data_{1:t-1}) &= \Sigma_{t-1} + Q_t.\\ \end{aligned} $$ c-aci-sequential-classification

Creating a new method: RL-OUPR

  • Combine gradual and abrupt changes
  • Single runlength (RL[1]) as choice auxiliary variable (M.1).

RL[1]-OUPR — choice of (M.2) and (M.3)

  • Reset if the hypothesis of a a changepoint is below some thresold $\varepsilon$.
  • OU-like reversion rate otherwise.
$$ g_t(r_t, \data_{1:t-1}) = \begin{cases} \mu_0\,(1 - \nu_t(r_t)) + \mu_{(r_t)}\,\nu_t(r_t) & \nu_t(r_t) > \varepsilon,\\ \mu_0 & \nu_t(r_t) \leq \varepsilon, \end{cases} $$ $$ G_t(r_t, \data_{1:t-1}) = \begin{cases} \Sigma_0\,(1 - \nu_t(r_t)^2) + \Sigma_{(r_t)}\,\nu_t(r_t)^2 & \nu_t(r_t) > \varepsilon,\\ \Sigma_0 & \nu_t(r_t) \leq \varepsilon. \end{cases} $$

RL[1]-OUPR — choice of (A.2)

  • Posterior predictive ratio test

$$ \nu_t(r_t^{(1)}) = \frac{p(y_t \cond r_t^{(1)}, x_t, \data_{1:t-1})\,(1 - \pi)} {p(y_t \cond r_t^{(0)}, x_t, \data_{1:t-1})\,\pi + p(y_t \cond r_t^{(1)}, x_t, \data_{1:t-1})\,(1-\pi)}. $$ Here, $r_{t}^{(1)} = r_{t-1} + 1$ and $r_{t}^{(0)} = 0$.


RL[1]-OUPR

  • A novel combination of (M.2) and (M.3) — reset or forget
    • Revert to prior proportional to probability of changepoint (forget).
    • Reset completely if prior is the most likely hypothesis (reset).
rl-oupr-sequential-classification

Sequential classification — comparison

In-sample hyperparameter optimisation.

comparison-sequential-classification

Unified view of examples in the literature

BONE-methods-examples

Experiment: hourly electricity load

Seven features:

  • pressure (kPa), cloud cover (%), humidity (%), temperature (C) , wind direction (deg), and wind speed (KmH).

One target variable:

  • hour-ahead electricity load (kW).

Features are lagged by one-hour.


Experiment: hourly electricity load

day-ahead electricity forecasting

Electricity forecasting during normal times

day ahead forecasting normal

Electricity forecasting before and after Covid shock

day ahead forecasting shock

Electricity forecasting

day ahead forecasting shock

Electricity forecasting results (2018-2020)

day ahead forecasting results

Experiment — heavy-tailed linear regression

heavy-tailed-linear regression panel

Heavy tailed linear regression (sample run)

heavy-tailed-linear-regression

Even more experiments!

  • Bandits.
  • Online continual learning.
  • Segmentation and prediction with dependence between changepoints.

See paper for details.


Conclusion

We introduce a framework for Bayesian online learning in non-stationary environments (BONE). BONE methods are written as instances of:

Three modelling choices

  • (M.1) Measurement model
  • (M.2) Auxiliary variable
  • (M.3) Conditional prior

Two algorithmic choices

  • (A.1) weighting over model parameters (posterior computation)
  • (A.2) weighting over choices of auxiliary function