BONE

A unifying framework for generalised Bayesian online learning in non-stationary environments

G. Duran-Martina,b, A.Y. Shestopaloffa, L. Sanchez-Betancourtb,c, K. Murphyd.
aQueen Mary University, bOxford-Man Institute of Quantitative Finance, cUniversity of Oxford, dGoogle Deepmind.
TMRL,

TL;DR

The BONE framework allows the classification and development of new online learning algorithms in non-stationary environments.

Our framework encompasses methods in the filtering literature (e.g., Kalman filter), the segmentation literature (e.g., Bayesian online changepoint detection), the continual learning literature (e.g., variational continual learning), the contextual bandit literature (e.g., Bayesian neural bandits), and the forecasting literature (e.g., Bayesian ensemble models).

State-space model (SSM) induced by BONE.

BONE components

The expected posterior predictive under BONE is $$ \begin{aligned} &\hat{\bf y}_t = \sum_{\psi_t \in \Psi_t} \nu(\psi_t, {\cal D}_{1:t})\, \int q(\theta; \psi_t, {\cal D}_{1:t})\, h(\theta, {\bf x}_t)\,d\theta,\\ &\\ &q(\theta; \psi_t, {\cal D}_{1:t}) \propto \exp(-\ell_t(\theta))\,\pi(\theta; \psi_t, {\cal D}_{1:t-1}). \end{aligned} $$

(M.1) Model for observations

Represents the function that maps the features \({\bf x}_t\) to the observations \({\bf y}_t\). Choice is dependent on the nature of the data and the environment.

Simple linear models $$ \begin{aligned} &\text{(Regression)}\\ &h(\theta, {\bf x}_t) = \theta^\intercal\,{\bf x}_t\\ &\\ &\text{(Classification)}\\ &h(\theta, {\bf x}_t) = \sigma(\theta^\intercal\,{\bf x}_t) \end{aligned} $$

Extensions to complex models

  1. Neural networks: For non-linear relationships and complex patterns (e.g., CIFAR-100 dataset).
  2. Time-series models: to model temporal dependencies (e.g., LSTM, ARIMA).
Choice of measurement model \(h(\theta, {\bf x}_t)\) in BONE.

(M.2) Choice of auxiliary variable

Encodes our belief about what a "regime" is.
name values cardinality
C \(\{c\}\) \(1\)
CPT \(2^{ \{0, 1, \ldots, t\} }\) \(2^t\)
CPP \([0,1]\) \(\infty\)
CPL \(\{0,1\}^t\) \(2^t\)
CPV \((0,1)^t\) \(\infty\)
ME \(\{1, \ldots, K\}\) \(K\)
RL \(\{0, 1, \ldots, t\}\) \(t\)
RLCC \(\{\{0, t\}, \ldots, \{t, 0\}\}\) \(2 + t(t+1)/2\)

C is constant, CPT is changepoint timestep, CPP is changepoint probability, CPL is changepoint location, CPV is changepoint probability vector, ME is mixture of experts, RL is runlength, RLCC is runlength with changepoint count.

Choice of auxiliary variable \(\psi_t\).

(M.3) Model for conditional priors

Modify prior beliefs about the model parameters based on past data and the auxiliary variable.

In the Gaussian case, the conditional prior modifies the mean and covariance $$ \begin{aligned} &\pi(\theta_t;\, \psi_t,\, {\cal D}_{1:t-1}) \\ &={\cal N}\big(\theta_t\vert g_{t}(\psi_t, {\cal D}_{1:t-1}), G_{t}(\psi_t, {\cal D}_{1:t-1})\big). \end{aligned} $$

Examples

C-static: constant auxvar with static updates. $$ \begin{aligned} &g_{t}(\psi_t, {\cal D}_{1:t-1}) = \mu_{t-1},\\ &G_{t}(\psi_t, {\cal D}_{1:t-1}) = \Sigma_{t-1}. \end{aligned} $$


RL-PR: runlength with prior reset $$ \begin{aligned} &g_{t}(\psi_t, {\cal D}_{1:t-1}) = \mu_{0}{\mathbb 1}(\psi_t = 0) + \mu_{(\psi_{t-1})}{\mathbb 1}(\psi_t > 0),\\ &G_{t}(\psi_t, {\cal D}_{1:t-1}) = \Sigma_{0}{\mathbb 1}(\psi_t = 0) + \Sigma_{(\psi_{t-1})}{\mathbb 1}(\psi_t > 0). \end{aligned} $$ The subscript \((\psi_t)\) denotes the posterior mean and covariance using the last \(\psi_t > 0\) observations.


RL-OUPR: Runlength with OU prior discount or prior reset $$ g_{t}(\psi_t, {\cal D}_{1:t-1}) = \begin{cases} \mu_0\,(1 - \nu_t(\psi_t)) + \mu_{(\psi_{t-1})}\,\nu_t(\psi_t) & \text{if } \psi_t > 0,\\ \mu_0 & \text{if } \psi_t = 0. \end{cases} $$

Choice of auxiliary variable \(\psi_t\).

(A.1 / A.2) Algorithm for posterior update for parameters \(\theta\) and auxiliary variables \(\psi_t\)

$$ \begin{aligned} q(\theta; \psi_t, {\cal D}_{1:t}) & \text{ posterior update}\\ \nu(\psi_t, {\cal D}_{1:t}) & \text{ weight update} \end{aligned} $$

Examples in the literature

Examples of BONE in the literature.

Experiment: Forecasting under distribution shift

We evaluate members of the BONE framework on electricity load forecasting during the Covid-19 pandemic. Our choice of measurement model h (M.1) is a two-hidden layer multilayered perceptron (MLP) with four units per layer and a ReLU activation function. Features are lagged by one hour.

Forecasting under distribution shift: changes in the output \({\bf y}_t\) (electricity load) are not reflected in the input \({\bf x}_t\) (weather features).

Results

Top: Forecasting results. Bottom: 5-day mean absolute error (MAE).