Filtering notes (III): the Kalman filter
Part III of the filtering notes series:
- Part I: signal plus noise models
- Part II: state-space-models
- Part III: the Kalman filter
Introduction
In this post, we derive the Kalman filter (KF) algorithm and explore how it can be applied to time series forecasting and sequential (online) learning. As we will see, the KF algorithm arises naturally from the covariance properties of state-space models discussed in the previous post.
The setup
Consider the linear state-space model $$ \tag{SSM.1} \begin{aligned} \Theta_t &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1},\\ Y_t &= \vH_t\,\Theta_t + E_t. \end{aligned} $$
Here, $Y_t \in \reals ^o$ is the observation (or measurement) random vector; $\Theta_\tau \in \reals ^d$ is the latent (unobserved) state; $U_t$ is the zero-mean dynamics noise process with known $d$-dimensional covariance matrix $\var(U_\tau) = Q_\tau$; $E_t$ is the zero-mean observation noise process with known $o$-dimensional covariance matrix $\var(E_t) = R_t$; $\vH_t$ is the known $(o\times d)$ transition matrix; and $\vF_\tau$ is the known $(d \times d)$ projection matrix. Here $\tau \geq 0$ and $t \geq 1$.
Throughout this post, we denote by $Y_t$ the measurement random vector at time $t$, and we denote by $y_t$ a sample of the random vector a time $t$. We write the sequence of random vectors $Y_1, \ldots, Y_t$ as the block-column vector $Y_{1:t} \in \reals^{t ~ o}$ and the sequence of observed samples by block-column vector $y_{1:t} \in \reals^{t ~ o}$.
Given the sample of measurements $y_{1:k}$, for some $k \geq 1$, which arrive in a stream, our goal is (i) to estimate the value of $\Theta_t$ given the observations $y_{1:k}$ and (ii) estimate the expected error in the estimate, i.e., quantify how far off we expect our estimate of the latent process to be away from the true latent process $\Theta_t$.
Here, our estimate of the latent state $\Theta_t$, given the observations $y_{1:k}$, is the linear mapping $\vA_{t|k} \in \reals^{d \times (k ~ o)}$: $$ \Theta_{t|k} = \vA_{t|k}~y_{1:k}, $$ where the choice of linear mapping is the best linear unbiased estimate 1 $$ \vA_{t|k} = \argmin_{\vA} \mathbb{E}[\| \Theta_t - \vA~Y_{1:k} \|_2^2] = \cov(\Theta_t, Y_{1:k})\,\var(Y_{1:k})^{-1}. $$
Next, our estimate of the expected error, denoted $\Sigma_{t|k}$, is given by $$ \Sigma_{t|k} = \var(\Theta_t - \Theta_{t|k}). $$
In words, $\Theta_{t|k}$ is our estimate of the latent state at time $t$ having seen $y_{1:k}$; and $\Sigma_{t|k}$ is the uncertainty about the value of the true latent state $\Theta_t$ given the estimate $\Theta_{t|k}$. We call $\Theta_{t|k}$ the best linear unbiased predictor (BLUP) and $\Sigma_{t|k}$ the error variance-covariance matrix (EVC).
In this post, we are interesting in predicting the one-step-ahead latent state and and expected error $(\Theta_{t|t-1}, \Sigma_{t|t-1})$, as well as the updated estimate of the latent state and its expected error $(\Theta_{t|t}, \Sigma_{t|t})$.
We write our quantities of interest in pseudocode as $$ \boxed{ \begin{aligned} &\text{// Predict step}\\ &\Theta_{t|t-1} = \vA_{t|t-1}~y_{1:t-1},\\ &\Sigma_{t|t-1} = \var(\Theta_t - \Theta_{t|t-1}),\\ &\text{// Update step}\\ &\Theta_{t|t} = \vA_{t|t}~y_{1:t},\\ &\Sigma_{t|t} = \var(\Theta_t - \Theta_{t|t}). \end{aligned} } $$ As we will see, the KF algorithm is a recursive and efficient way to estimate the sequence of predicted and updated pairs. $$ (\Theta_{0|0}, \Sigma_{0|0}) \xrightarrow{\text{predict}} (\Theta_{1|0}, \Sigma_{1|0}) \xrightarrow{\text{update}} (\Theta_{1|1}, \Sigma_{1|1}) \xrightarrow{\text{predict}} (\Theta_{2|1}, \Sigma_{2|1}) \xrightarrow{\text{update}} \cdots $$
The Kalman filter algorithm
Assuming that $\Theta_{0|0}$ and $\Sigma_{0|0}$ are known, and the measurements $y_{1:t}$ arrive in a sequence, the KF algorithm first makes a a predict step that computes the pair $(\Theta_{t|t-1}, \Sigma_{t|t-1})$, and then makes an update step that computes the pair $(\Theta_{t|t}, \Sigma_{t|t})$.
We formalise the predict and step updates in the following propositions. Below, Proposition 1 shows how to compute $(\Theta_{t|t-1}, \Sigma_{t|t-1})$, Proposition 2 shows how to compute $\Theta_{t|t}$, and Proposition 3 shows how to compute $\Sigma_{t|t}$.
Proposition 1: The Kalman filter prediction
Let $\Sigma_{1|0} = \var(\Theta_1)$. The predicted BLUP and EVC take the form
$$ \begin{aligned} \Theta_{t|t-1} &= \vF_{t-1}\,\Theta_{t-1|t-1},\\ \Sigma_{t|t-1} &= \vF_{t-1}\,\Sigma_{t-1|t-1}\,\vF_{t-1}^\intercal + Q_{t-1}. \end{aligned} $$For a proof, see the Appendix.
Proposition 2: The Kalman filter mean update
Let $(\Theta_{t|t-1}, \Sigma_{t|t-1})$ be the predicted BLUP and predicted EVC respectively, $\vH_t$ be the projection matrix, $R_t$ be the observation covariance, and $y_t$ be the observation at time $t$. The updated BLUP $\Theta_{t|t}$ can be written in one of two ways $$ \Theta_{t|t} = \begin{cases} \Theta_{t|t-1} + \vK_t\,\cal{E}_t,\\ \vK_t\,y_t + (\vI - \vK_t\,\vH_t)\,\Theta_{t|t-1},\\ \end{cases} $$ where $$ \begin{aligned} {\cal E}_t &= y_t - \vH_t\,\Theta_{t|t-1},\\ \vK_t &= \Sigma_{t|t-1}\vH_t^\intercal\,S_t^{-1},\\ S_t &= \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t. \end{aligned} $$ Here, ${\cal E}_t$ is the innovation at time $t$, $\vK_t$ the Kalman gain matrix, and $S_t$ is the variance of the innovation at time $t$.
For a proof, see the Appendix.
Proposition 3: The Kalman filter covariance update
Let $\vSigma_{t|t-1}$ be the predicted EVC. The updated EVC matrix $\Sigma_{t|t}$ can be written in one of three ways: $$ \Sigma_{t|t} = \begin{cases} \left[\vI - \vK_t\,\vH_t\right]\,\Sigma_{t|t-1}\,\left[\vI - \vK_t\,\vH_t\right]^\intercal + \vK_t R_t\,\vK_t^\intercal,\\ \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1},\\ \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal. \end{cases} $$
For a proof, see the Appendix.
The algorithm
Propositions 1 - 3 encapsulate the KF algorithm. We show the KF pseudoalgorithm below.
Take $\Sigma_{1|0} = \var(\Theta_1) = \vF_0~\Sigma_{0|0}~\vF_0^\intercal = Q_0$ and assume $\Theta_{0|0}$ is known. Then, for $t=1, \ldots, T$, a step of the KF algorithm is given by
$$ \boxed{ \begin{array}{l|l} \text{// Predict step}\\ \Theta_{t|t-1} = \vF_t\,\Theta_{t-1|t-1} & \cov(\Theta_t, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,y_{1:t-1}\\ \Sigma_{t|t-1} = \vF_t\,\Sigma_{t-1|t-1}\,\vF_t^\intercal + Q_t & \var(\Theta_t - \Theta_{t|t-1})\\ Y_{t|t-1} = \vH_t~\Theta_{t|t-1} & \cov(Y_t, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,y_{1:t-1}\\ \text{// Gain and innovation} \\ S_t = \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t & \var({\cal E}_t)\\ \vK_t = \Sigma_{t|t-1}\vH_t^\intercal\,S_t^{-1} & \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1} \\ {\cal E}_t = y_t - Y_{t|t-1}\\ \text{// Update step}\\ \Theta_{t|t} = \Theta_{t|t-1} + \vK_t\,{\cal E}_t & \cov(\Theta_t, Y_{1:t})\,\var(Y_{1:t})^{-1}\,y_{1:t}\\ \Sigma_{t|t} = \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal & \var(\Theta_t - \Theta_{t|t}) \end{array} } $$ In the pseudocode above, the term $Y_{t|t-1}$ denotes the one-step-ahead forecast of the measurement $y_t$ having observations $y_{1:t-1}$.
Python implementation
The pseudoalgorithm above can be readily implemented in Python.
Here, we define a jax.lax.scan
-compatible
function, which we use in the experiments below.
|
|
The implementation above obtains the Kalman gain matrix $\vK_t$ through solve(S, H @ Sigma_pred.T)
.
This step avoids having to invert the matrix S
explicitly.
Instead, we observe that the Kalman gain can be written as
$$
\vK_t = \Sigma_{t|t-1}~\vH_t^T~S_t^{-1} \iff \vK_t~S_t = \Sigma_{t|t-1}\vH_t^\intercal
$$
If we transpose each of the terms in the equation above, we obtain the system of equations
$$
S_t~\vK_t^\intercal = \vH_t~\Sigma_{t|t-1}
$$
which takes the form $\vA ~ \vX = \vB$,
with
$\vA$ a $(o\times o)$ positive definite matrix,
$\vB$ a known $(o \times m)$ matrix,
and
$\vX$ the unknown $(o \times m)$ matrix.
Thus, we use the solve
function to find $\vX = \vK_t^\intercal$.
Experiments
Here, we present two use cases for the Kalman filter: time-series forecasting and sequential (online) learning.
In the first experiment, we use the KF algorithm to perform multi-step forecasting of a moving average (MA) model. We show that if the underlying statistical model can be written as a linear SSM, then sequential forecasting can be readily done using the KF equations.
In the second experiment, we show that the flexibility of the SSM assumption allows us to tackle a harder problem: multi-step forecasting and online estimation of the coefficients of a time series model.
Experiment 1: The Kalman filter for forecasting
We consider the classical problem of time-series forecasting. See this notebook for the code we use in this experiment.
Suppose we are given observations $y_{1:T}$ sampled from some SSM process. Having seen observations up to time $t$, our goal is to make $k \geq 1$ future estimates of the observations, i.e., having $y_{1:t}$, we seek, to construct the $k$-dimensional vector of forecasts $Y_{t+1:t+k|t} = Y_{1+t|t}, \ldots, Y_{t+k|t}$, where $Y_{t+k|t} = \vH_{t+k}~\Theta_{t+k|t}$. Because $\vH_t$ is known for all $t$, a $k$-step-step ahead forecast requires only the estimation of the BLUP $\Theta_{t+\ell|t}$ for $\ell=1,\ldots, k$ (we relax this assumption in Experiment 2).
Environment setup: the moving-average (MA) model
Take the observations $y_{1:T}$ sampled from an $m$-th order moving average (MA($m$)) process with known $m\geq 1$ and known SSM coefficients. As we saw in previous post, the SSM form of an MA($m$) process is $$ \tag{MA.1} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ Y_t &= \vH_{\rm MA}\,\Theta_t + E_t. \end{aligned} } $$ In this experimentIn this experiment, we take the $(1\times m)$ matrix of coefficients $\vH_{\rm MA} = [1, 1, \ldots, 1]$, $\var(E_t) = R_t = 1.0$ for all $t$, and $\vF_{\rm MA}$ and ${\cal T}_m$ are defined in F.02, MA process.
In the code below, we specify the configuration of the MA process we use to sample the sequence of observations.
|
|
The function init_ma_components
is defined in
F.02, MA process.
A sample of an MA($m$) process with $m=10$ and $E_t \sim {\cal N}(0, R_t)$ is shown below. 2
Filtering and one-step-ahead forecasts
Consider a sample run of observations $y_{1:t}$ which are stored
in a t
-dimensional array y
.
Then, consecutive one-step-ahead forecast of the time-series, i.e,
$Y_{1|0}, Y_{2|1}, \ldots, Y_{t|t-1}$, can be done by simply running the KF algorithm.
This is because the innovation term in the KF is the difference between the observation $y_t$ and
the one-step-ahead forecast $Y_{t|t-1} := \vH_{\rm MA}~\Theta_{t|t-1}$.
In code, we find these one-step-ahead forecasts by running the kalman_filter_step
and storing y_hat
.
|
|
The output hist["y_corrected"]
stores the filtered (updated) estimate of the observations, i.e. $Y_{t|t}$, and
the output hist["yhat"]
stores the one-step-ahead forecast, i.e., $Y_{t|t-1}$.
The figure below shows the sampled MA($m$) process (y
) as well as the one-step-ahead forecast hist["yhat"]
.
The black lines shows the distance between the forecasts and the realised values.
The takeaway from this plot is that
one-step-ahead predictions are obtained as a by-product of rewriting the model as an SSM and running the KF algorithm.
We can easily extend this idea to multi-step forecasting.
We show this next.
Multi-step-ahead forecast
In various time-series problems, the goal is to obtain a multi-step forecasts $Y_{t + k|t}$ for fixed $t$ and varying $k \geq 1$. Following Proposition 2.2 in F.02, the $k$-step ahead forecast estimate $\Theta_{t+k|t}$ is obtained by pre-multiplying $\Theta_{t|t}$ with the next $k$ transition matrices, i.e., $\Theta_{t+k|t} = (\prod_{\tau=1}^k \vF_{t+\tau})~\Theta_{t|t}$. Then, the $k$-step ahead forecast $Y_{t+k|t}$ is given by pre-multiplying the predicted state $\Theta_{t+k|t}$ times $\vH_{t+k}$, i.e., $Y_{t+k|t} = \vH_{t+k}~\Theta_{t+k|t}$. 3
We specify how to carry out multiple $k$-step ahead predictions for the MA model in the code snippet below. The required inputs for the function are the state $\Theta_{t|t}$, the known transition matrix $\vF_{\rm MA}$, the known projection matrix $\vH_{\rm MA}$, and the number of forecasts $k$.
|
|
Next, a time-series of forecasted values can be obtained by modifying
the out
dictionary in kalman_filter_step
to be
|
|
with n_forecast >= 1
.
The following plot shows the updated estimate for $y_t$, namely $Y_{t|t}$ (hist["y_filter"]
)
and the $10$-step-ahead forecasts $Y_{t+k|t}$ for $k=1,\ldots, 10$ (hist["y_forecast"]
).
The blue lines correspond to forecasts where $Y_{t|t} > 0$
and the red lines correspond to forecasts where $Y_{t|t} < 0$.
We can see a clear pattern for the MA($m$) forecast which we can roughly summarise as follows:
if $Y_{t|t}$ is above zero, predict a downward trend;
if $Y_{t|t}$ is below zero, predict an upward trend.
Multi-step error forecast
Finally, we combine our forecasts $Y_{t+k|t}$ with the expected forecasted error $\var(Y_{t+k} - Y_{t+k|t})$. In particular, we compute the error bound $Y_{t+k|t} \pm \sqrt{\var(Y_{t+k} - Y_{t+k|t})}$ that quantifies our uncertainty about the value of the future observation (under assumption of symmetry of error).
The estimated forecasted error $\var(Y_{t+k} - Y_{t+k|t})$ can be computed directly from the predicted EVC $\Sigma_{t+k|t}$ derived in Proposition 2.2 in F.02. This is because $$ \begin{aligned} \var(Y_{t+k} - Y_{t+k|t}) &= \var\left(\vH_{t+k}\,\Theta_{t+k} + E_{t+k} - \vH_{t+k}\,\Theta_{t+k|t}^\intercal\right)\\ &= \var\left(\vH_{t+k}\,\Theta_{t+k} - \vH_{t+k}\,\Theta_{t+k|t}^\intercal\right) + \var(E_{t+k})\\ &= \vH_{t+k}\,\var\left(\Theta_{t+k} - \Theta_{t+k|t}\right)\,\vH_{t+k}^\intercal + R_t\\ &=\vH_{t+k}\,\Sigma_{t+k|t}\,\vH_{t+k}^\intercal + R_t, \end{aligned} $$ where the second equality follows from property (P.2) in Proposition 2.1 in F.02. We code this result as follows.
|
|
The following figure shows
the updated (filtered) steps $Y_{t|t}$ in black,
the forecasted observations $Y_{t+k|t}$ in red,
its error bounds with two standard deviations in the red area,
and
the true future observations $Y_{t+k}$ in blue.
Experiment 2: Kalman filtering for (online) learning and forecasting
The previous experiment showed that knowing the coefficients of an MA process ($\vH_{\rm MA}$) enables efficient sequential forecasting of the time-series. In real life scenarios, however, we rarely (if ever) get to observe the true coefficients of an MA process. Consequently, they must be estimated. See this notebook for the code we use in this experiment.
The classical approach to estimating the coefficients of traditional time series models, such as the MA model (assuming that the noise terms $E_t$ are zero-mean Gaussians), is through maximum-likelihood estimation: given some some training data of past observations $y_{1:t}$, we aim to find the matrix $\vH$ that maximises the likelihood of the data having been generated by $\vH$. 4 5
In this experiment, however, we adopt an online approach to estimating $\vH_{\rm MA}$. Because the observations $y_{1:T}$ arrive sequentially (one datapoint at a time), we continuously update our estimate as new data streams in.
As we show below, sequential estimation of the matrix $\vH_{\rm MA}$ and sequential forecasting of $Y_{t+k}$ can be achieved jointly if we introduce an additional latent variable $\vH_t$ that evolves over time. As a consequence, we obtain an extended SSM, over which we can filter to obtain estimates $(\Theta_{t+k|t}, \vH_{t+k|t}, Y_{t+k|t})$ for fixed $t$ and varying $k\geq 1$.
This mode of learning via an extended SSM allows us to process time series data in real time and adapt to shifts in the data generating process.6 While online learning ARMA-style models has been studied in previous works, 7 8 the method here follows a simplified approach based on a linearisation technique known as the extended Kalman filter.9
Problem setup
Suppose we are given a sequence of observations $y_{1:T}$, which we are told is sampled from an MA($m$) process with known $m$, but unknown coefficients $\vH_{\rm MA} = [a_1, \ldots, a_m]$. Our task is to estimate the future observations $Y_{t+k|t}$ given $y_{1:t}$. Because we do not know the coefficients of the MA($m$) process, computation of future observations $Y_{t+k|t}$ requires the estimation of (i) the latent vector $\Theta_{t}$, as well as (ii) the coefficients (or parameters) $\vH_{\rm MA}$.
A practical approach to handling SSMs with unknown system components is to expand the SSM by incorporating those unknowns into the latent process. In our case, for example, the unknown component is the vector of coefficients $\vH_{\rm MA}$. Thus, we expand $({\rm MA.1})$ to include an $m$-dimensional vector $\vH_t$ that follows a random walk, and which represents the evolution of the coefficients. Under these considerations, the expanded SSM takes the form $$ \tag{MA.2} \boxed{ \begin{aligned} \vH_t &= \vH_{t-1} + J_t,\\ \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ Y_t &= \vH_t^\intercal\,\Theta_t + E_t. \end{aligned} } $$ Where $\vH_t \in \reals^{m}$ is a column vector (the transpose of $\vH_{\rm MA}$), $\mathbb{E}[J_t] = 0$ and $\var(J_t) = O_t$, with $O_t$ an $m\times m$ positive-definite covariance matrix. Note that the true MA coefficients $\vH_{\rm MA}$ do not evolve over time. The assumption of a random walk for $\vH_t$ is simply to prevent fast (and possibly biased) convergence of the coefficients, as well as to provide numerical stability.
The form for $Y_t$ in $({\rm MA.2})$ is said to be in bilinear form.10 11 Because of this, standard Kalman filtering over $(\vH_t, \Theta_t)$ is not possible.
A classical approach to deal with this problem is to approximate the observation model as a linear model through a first-order approximation of the observation $Y_t$ around the previous estimates $\vH_{t-1|t-1}$ and $\Theta_{t-1|t-1}$. This linearisation procedure dates back to the 1960s and is known as the Extended Kalman filter. 12 For the MA model, linearisation around $(\vH_{t-1|t-1}, \Theta_{t-1|t-1})$ takes the form $$ \tag{Y.1} \begin{aligned} \bar{Y}_t &= \vH_{t-1|t-1}^\intercal\,\Theta_{t-1|t-1} + \vH_{t-1|t-1}^\intercal\,\left( \Theta_t - \Theta_{t-1|t-1} \right) + \Theta_{t-1|t-1}^\intercal\,\left(\vH_t - \vH_{t-1|t-1}\right)\\ &= \vH_{t-1|t-1}^\intercal\,\Theta_t + \Theta_{t-1|t-1}^\intercal \vH_t - \vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}. \end{aligned} $$
$\vH_t \in \reals^{m\times 1}$, $\Theta_t \in \reals^{m\times 1}$.
The expanded SSM in KF-compatible form
With the assumptions in $({\rm MA.2})$ and the linearised measurement $({\rm Y.1})$, the new linearised SSM takes the form $$ \tag{MA.3} \boxed{ \begin{aligned} \vH_t &= \vH_{t-1} + J_t,\\ \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ \bar{Y}_t &= \vH_{t-1|t-1}^\intercal\,\Theta_t + \Theta_{t-1|t-1}^\intercal \vH_t - \vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}. \end{aligned} } $$
Next, to perform Kalman filtering,
we rewrite $({\rm MA.3})$ in the form $({\rm SSM.1})$.
Doing so will allow us to make use of the kalman_filter_step
function defined above.
We begin by introducing an expanded latent variable, an expanded transition matrix, and an expanded projection matrix: $$ \begin{aligned} \vW_t = \begin{bmatrix} \vH_t \\ \Theta_t \end{bmatrix}, & & \vA = \begin{bmatrix} \vI_m & \vzero \\ \vzero & \vF_{\rm MA} \end{bmatrix}, & & \vC_t = \begin{bmatrix} \Theta_{t-1|t-1}^\intercal & \vH_{t-1|t-1}^\intercal \end{bmatrix}. \end{aligned} $$ Then, we define the dynamics noise $\vB~U_t$ with $$ \begin{aligned} \vB = \begin{bmatrix} \vI_m & \vzero \\ \vzero & {\cal T}_m \end{bmatrix}, & & U_t = \begin{bmatrix} J_t \\ E_t \end{bmatrix}. \end{aligned} $$ Next, the SSM $({\rm MA.3})$ is written in canonical SSM form as $$ \tag{MA.4} \boxed{ \begin{aligned} \vW_t &= \vA\,\vW_{t-1} + \vB~U_t\\ \bar{Y}_t &= \vC_t\,\vW_t - \vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}. \end{aligned} } $$
We see that $({\rm MA.4})$ resembles $({\rm SSM.1})$ up to a matrix that multiplies the dynamics of the latent state $U_t$ and a bias term in the measurements that takes the form $\vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}$. We also require the dynamics covariance matrix $Q_t$ and the innovations ${\cal E}_t$. For the former, we have $$ \begin{aligned} \var(\vB\,U_t) &= \vB\,\var(Ut)\,\vB^\intercal\\ &= \vB\, \begin{bmatrix} O_t & \vzero\\ \vzero & R_t \end{bmatrix} \,\vB^\intercal\\ &= \begin{bmatrix} \vI_m & \vzero \\ \vzero & {\cal T}_m \end{bmatrix} \begin{bmatrix} O_t & \vzero\\ \vzero & R_t \end{bmatrix} \begin{bmatrix} \vI_m & \vzero \\ \vzero & {\cal T}_m^\intercal \end{bmatrix} \\ &= \begin{bmatrix} O_t & \vzero \\ \vzero & {\cal T}_m\,R_t\,{\cal T}_m^\intercal \end{bmatrix}\\ &= Q_t. \end{aligned} $$
For the latter, we have $$ {\cal E}_t = Y_t - (\vC_t\,\vW_{t|t-1} - \vH_{t-1|t-1}\,\vC_{t-1|t-1}). $$
Finally, setting up the code for the new SSM that we can use with the kalman_filter_step
function
becomes a straightforward (but tedious) coding exercise.
See the definition of KFState
and kf_step_inner
in the notebook for details.
By filtering $({\rm MA.4})$, we are performing both forecasting of the time-series and learning of the MA parameters online, i.e.,one datapoint at a time. Of course, one can expect worse performance using the learned $\vH_{t|t}$ than the true matrix $\vH_{\rm MA}$. However, these forecasts should gradually improve over time.
Next, we analyse the performance of filtering the expanded SSM on the sampled dataset shown above.
Results
We analyse the result of filtering the expanded SSM with the sampled series $y_{1:T}$ shown in the previous experiment.
We begin by plotting the evolution of the filtered (or learned) parameters $\vH_{t|t}$ over time.
One of the most salient points from the Figure above is that the estimated model parameters do not convergence to the true
model parameters $\vH_{\rm MA}$ which, in this experiment, is set to be a vector of ones.
This is because of the so-called observability condition for Kalman filtering,13
which states a condition that the SSM must have to recover the true value of the state.
As it turns out, our choice of SSM is non-observable.
One easy check for this is to note that we obtain the same prediction at time $t$
if we scale $\vH_{t|t}$ by
$\alpha\,\vH_{t|t}$
and $\Theta_{t|t}$ by
$1/\alpha\,\Theta_{t|t}$, for some $\alpha > 0$.
One-step-ahead forecast
Despite the non-observability of our SSM, we can still get meaningful predictions. The following plot shows the (best) one-step-ahead prediction using $({\rm MA.1})$, the (learned) one-step-ahead prediction using $({\rm MA.4})$, and the true one-step-ahead value.

As expected, at the begining of the experiment, the one-step-ahead forecasts using the learned $\vH_{t|t}$ do not align with the forecasted valued using the true $\vH_{\rm MA}$. However, these gradually start to resemble the best predictions. To validate this observation, the following plot shows the 100-step rolling root mean squared error between the one-step-ahead forecast and the true forecast.

We see that at the beginning of the experiment, the RMSE of the learned projection matrix $\vH_{t|t}$ is much higher than the optimal $\vH_{\rm MA}$. However, as we process more data, the RMSE for prediction using the learned projection matrix starts to match the true projection matrix.
Multi-step forecast
Finally, we test the performance of filtering $({\rm MA.4})$ for multi-step forecasting. To evaluate the performance of the multi-step-ahead forecast, we compute the correlation between the $k$-step-ahead predictions and the observed values, i.e., $$ \rho_k = \frac {\sum_t (Y_{t+k|t} - \bar{Y}_{t+k})(y_{t+k} - \bar{y}_{t+k})} {\left(\sum_t (Y_{t+k|t} - \bar{Y}_{t+k})^2~ \sum_t (y_{t+k} - \bar{y}_{t+k})^2\right)^{1/2}} $$ for varying $k$. Here, $ \bar{Y}_{t+k} = \frac{1}{T-k}\sum_t Y_{t+k|t} $, and $ \bar{y}_{t+k} = \frac{1}{T-k}\sum_t y_{t+k} $. The sums run from $t=1$ to $T-k$.
The figure above shows that the correlation coefficient $\rho_k$ using the true $\vH_{\rm MA}$
has similar performance to that of using the learned $\vH_{t|t}$ matrix.
Appendix
Recall the following properties of an SSM and innovations
- (P.1) $Y_{1:t} = \vL\,{\cal E}_{1:t}$ — measurement / innovation relationship,
- (P.2) ${\cal E}_t = Y_t - \vH_t\,\Theta_{t|t-1}$ — innovation under an SSM,
- (P.3) $\var(Y_{1:t}) = \vL\,{S}_{1:t}\,\vL^\intercal$ — Cholesky decomposition for variance of measurements,
- (P.4) $\var({\cal E}_{1:t}) = S_{1:t} = {\rm diag}(S_1, \ldots, S_t)$ — variance of innovations, and
- (P.5) $\cov(\Theta_t, {\cal E}_t) = \Sigma_{t|t-1}\,\vH_t^{\intercal}$.
Next, recall that the BLUP and EVC matrix take the form $$ \begin{aligned} \Theta_{t|k} &= \cov(\Theta_t, Y_{1:k})\,\var(Y_{1:k})^{-1}\,Y_{1:k}\\ \Sigma_{t|k} &= \var(\Theta_t - \Theta_{t|k}). \end{aligned} $$
Proof of Proposition 1
Proposition 1 is a special case of Proposition 2.2, F.02 and Proposition 2.3, F.02, however, we derive them here for completeness.
First, the predicted BLUP gets rewritten as: $$ \begin{aligned} \Theta_{t|t-1} &= \cov(\Theta_t, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,Y_{1:t-1}\\ &= \cov(\vF_{t-1}\,\Theta_{t-1} + U_{t-1}, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,Y_{1:t-1}\\ &= \vF_{t-1}\,\cov(\Theta_{t-1}, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,Y_{1:t-1}\\ &= \vF_{t-1}\,\Theta_{t-1|t-1}, \end{aligned} $$ where the third equality follows from condition (P.3) in Proposition 2.1, F.02.
Next, the predicted EVC takes the form $$ \begin{aligned} \Sigma_{t|t-1} &= \var(\Theta_t - \Theta_{t|t-1})\\ &= \var\left(\vF_{t-1}\,\Theta_{t-1} + U_{t-1} - \vF_{t-1}\,\Theta_{t-1|t-1}\right)\\ &= \var\left(\vF_{t-1}\,\Theta_{t-1} - \vF_{t-1}\,\Theta_{t-1|t-1}\right) + \var(U_{t-1})\\ &= \vF_{t-1}\,\var\left(\Theta_{t-1} - \Theta_{t-1|t-1}\right)\,\vF_{t-1}^\intercal + Q_{t-1}\\ &= \vF_{t-1}\,\Sigma_{t-1|t-1}\,\vF_{t-1}^\intercal + Q_{t-1}. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 2
Here, we show Proposition 2.
First, recall the additive representation of the BLUP form for $\Theta_{t|t}$: $$ \begin{aligned} \Theta_{t|t} &=\cov(\Theta_t, Y_{1:t})\var(Y_{1:t})^{-1}\,Y_{1:t}\\ &=\cov(\Theta_t, \vL{\cal E}_{1:t})(\vL\,S_{1:t}\,\vL^\intercal)^{-1}\,\vL{\cal E}_{1:t}\\ &= \cov(\Theta_t, {\cal E}_{1:t})\vL^\intercal\,\vL^{-\intercal}\vS_{1:t}^{-1}\vL^{-1}\,\vL\,{\cal E}_{1:t}\\ &= \cov(\Theta_t, {\cal E}_{1:t})\,{\rm diag}(S_1^{-1}, \ldots, S_t^{-1})\,{\cal E}_{1:t}\\ &= \begin{bmatrix} \cov(\Theta_t, {\cal E}_1) & \ldots & \cov(\Theta_t, {\cal E}_t) \end{bmatrix} \begin{bmatrix} S_1^{-1}\,{\cal E}_1 \\ \vdots \\ S_t^{-1}{\cal E}_t \end{bmatrix}\\ &=\sum_{k=1}^t \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k, \end{aligned} $$ where have used Properties (P.1), (P.3), and (P.4) above.
Then, we decomopose the BLUP into its predicted and update forms $$ \begin{aligned} \Theta_{t|t} &= \sum_{k=1}^t \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k\\ &= \underbrace{ \cov(\Theta_t, {\cal E}_t) }_{\Sigma_{t|t-1}\,\vH_t^\intercal} \,S_t^{-1}\,{\cal E}_t + \underbrace{ \sum_{k=1}^{t-1} \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k }_{\Theta_{t|t-1}}\\ &= \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}{\cal E}_t + \Theta_{t|t-1}\\ &=\Theta_{t|t-1} + \vK_t\,{\cal E}_t\\ &= \vK_t\,(Y_t - \vH_t\,\Theta_{t|t-1}) + \Theta_{t|t-1}\\ &= \vK_t\,Y_t + (\vI - \vK_t\,\vH_t)\Theta_{t|t-1}, \end{aligned} $$ where the second equality follows from property (P.5), the fifth equality follows from property (P.2), $\vK_t = \Sigma_{t|t-1}\vH_t^\intercal S_t^{-1}$, and $S_t = \vH_t,\Sigma_{t|t-1},\vH_t^\intercal + R_t$ was derived in Proposition 2.5, F.02. $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 3
Here, we show Proposition 3
Recall $\Sigma_{t|t-1} = \var(\Theta_t - \Theta_{t|t-1})$ and $\Theta_{t|t} = \Theta_{t|t-1} + \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,{\cal E}_t$. Then, $$ \Sigma_{t|t} = \var(\Theta_t - \Theta_{t|t}) = \var\left(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)S_t^{-1}\,{\cal E}_t\right). $$
Form 1 (also known as the Joseph form 14) follows from Property (P.3) and rewriting the expression above in terms of $\Sigma_{t|t-}$ and $R_t$: $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)S_t^{-1}\,{\cal E}_t)\\ &= \var((\Theta_t - \Theta_{t|t-1}) - \vK_t\,{\cal E}_t)\\ &= \var((\Theta_t - \Theta_{t|t-1}) - \vK_t\,(\vH_t[\Theta_t - \Theta_{t|t-1}] + E_t))\\ &= \var((\Theta_t - \Theta_{t|t-1}) - \vK_t\,\vH_t[\Theta_t - \Theta_{t|t-1}] + \vK_t\,E_t)\\ &= \var\Big([\vI - \vK_t\,\vH_t](\Theta_t - \Theta_{t|t-1}) - \vK_t\,E_t\Big)\\ &= [\vI - \vK_t\,\vH_t]\var(\Theta_t - \Theta_{t|t-1})[\vI - \vK_t\,\vH_t]^\intercal + \vK_t\var(E_t)\vK_t^\intercal\\ &= [\vI - \vK_t\,\vH_t]\Sigma_{t|t-1}[\vI - \vK_t\,\vH_t]^\intercal + \vK_t R_t\vK_t^\intercal, \end{aligned} $$ where the sixth line follows from Property (P.2) in F.02.
Alternatively, form 2 follows from Property (P.5) and rewriting the EVC in terms of $\Sigma_{t|t-1}$ and $S_t$: $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,{\cal E}_t)\\ &= \var(\Theta_t - \Theta_{t|t-1}) - \var(\cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,{\cal E}_t)\\ &= \var(\Theta_t - \Theta_{t|t-1}) - \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,\,\var({\cal E}_t)\,S_t^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &= \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}. \end{aligned} $$
Finally, form 3 is a direct consequence of form 2. From the expression above, we have $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t - \Theta_{t|t-1}) - \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,\,\var({\cal E}_t)\,S_t^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &= \var(\Theta_t - \Theta_{t|t-1}) - \vK_t\,\var({\cal E}_t)\,\vK_t^\intercal\\ &= \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$
in general, the noise term $E_t$ does not need to be Gaussian, but this assupmtion is useful for sampling and optimal filtering. ↩︎
This illustrates that the class of AR/MA/ARIMA and related models have quite a big inductive bias embedded in their assumption, which might explain why, if they are well-calibrated, they can match or outperform their (non-linear) neural network counterparts in the single-dataset setting, which, in contrast, have to find a good feature transformation for the evolution for the state using either limited or noisy information, see e.g., https://cbergmeir.com/talks/neurips2024/. ↩︎
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation ↩︎
McLeod, A. Ian, and Ying Zhang. “Faster ARMA maximum likelihood estimation.” Computational Statistics & Data Analysis 52.4 (2008): 2166-2176. ↩︎
Duran-Martin, G., et al. “A unifying framework for generalised Bayesian online learning in non-stationary environments.” Transactions on Machine Learning Research (2025). ↩︎
Anava, Oren, et al. “Online learning for time series prediction.” Conference on learning theory. PMLR, 2013. ↩︎
Liu, C., Hoi, S. C., Zhao, P., & Sun, J. (2016). Online ARIMA Algorithms for Time Series Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). ↩︎
The name extended is not due the extended SSM $(\vH_t, \Theta_t)$ we work with, but rather, it refers to extending the KF to work with non-linear state-space models. ↩︎
Benesty, Jacob, Constantin Paleologu, and Silviu Ciochină. “On the identification of bilinear forms with the Wiener filter.” IEEE Signal Processing Letters 24.5 (2017): 653-657. ↩︎
Harville, David A. “Matrix algebra from a statistician’s perspective.” (1998): Chapter 14. ↩︎
Battin, Richard H. “Space guidance evolution-a personal narrative.” Journal of Guidance, Control, and Dynamics 5.2 (1982): 97-110. ↩︎
Southallzy, B., B. Buxtony, and J. Marchant. “Controllability and observability: Tools for Kalman filter design.” British Machine Vision Conference. Vol. 98. 1998. ↩︎
Zanetti, Renato, and Kyle J. DeMars. “Joseph formulation of unscented and quadrature filters with application to consider states.” Journal of Guidance, Control, and Dynamics 36.6 (2013): 1860-1864. ↩︎