Filtering notes (III): the Kalman filter

Posted on May 6, 2025

tl;dr: Kalman filter derivation and its application time-series forecasting and online learning.

Part III of the filtering notes series:

Introduction

In this post, we derive the Kalman filter (KF) algorithm and explore how it can be applied to time series forecasting and sequential (online) learning. As we will see, the KF algorithm arises naturally from the covariance properties of state-space models discussed in the previous post.

The setup

Consider the linear state-space model $$ \tag{SSM.1} \begin{aligned} \Theta_t &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1},\\ Y_t &= \vH_t\,\Theta_t + E_t. \end{aligned} $$

Here, $Y_t \in \reals ^o$ is the observation (or measurement) random vector; $\Theta_\tau \in \reals ^d$ is the latent (unobserved) state; $U_t$ is the zero-mean dynamics noise process with known $d$-dimensional covariance matrix $\var(U_\tau) = Q_\tau$; $E_t$ is the zero-mean observation noise process with known $o$-dimensional covariance matrix $\var(E_t) = R_t$; $\vH_t$ is the known $(o\times d)$ transition matrix; and $\vF_\tau$ is the known $(d \times d)$ projection matrix. Here $\tau \geq 0$ and $t \geq 1$.

Throughout this post, we denote by $Y_t$ the measurement random vector at time $t$, and we denote by $y_t$ a sample of the random vector a time $t$. We write the sequence of random vectors $Y_1, \ldots, Y_t$ as the block-column vector $Y_{1:t} \in \reals^{t ~ o}$ and the sequence of observed samples by block-column vector $y_{1:t} \in \reals^{t ~ o}$.

Given the sample of measurements $y_{1:k}$, for some $k \geq 1$, which arrive in a stream, our goal is (i) to estimate the value of $\Theta_t$ given the observations $y_{1:k}$ and (ii) estimate the expected error in the estimate, i.e., quantify how far off we expect our estimate of the latent process to be away from the true latent process $\Theta_t$.

Here, our estimate of the latent state $\Theta_t$, given the observations $y_{1:k}$, is the linear mapping $\vA_{t|k} \in \reals^{d \times (k ~ o)}$: $$ \Theta_{t|k} = \vA_{t|k}~y_{1:k}, $$ where the choice of linear mapping is the best linear unbiased estimate ¹ $$ \vA_{t|k} = \argmin_{\vA} \mathbb{E}[\| \Theta_t - \vA~Y_{1:k} \|_2^2] = \cov(\Theta_t, Y_{1:k})\,\var(Y_{1:k})^{-1}. $$

Next, our estimate of the expected error, denoted $\Sigma_{t|k}$, is given by $$ \Sigma_{t|k} = \var(\Theta_t - \Theta_{t|k}). $$

In words, $\Theta_{t|k}$ is our estimate of the latent state at time $t$ having seen $y_{1:k}$; and $\Sigma_{t|k}$ is the uncertainty about the value of the true latent state $\Theta_t$ given the estimate $\Theta_{t|k}$. We call $\Theta_{t|k}$ the best linear unbiased predictor (BLUP) and $\Sigma_{t|k}$ the error variance-covariance matrix (EVC).

In this post, we are interesting in predicting the one-step-ahead latent state and and expected error $(\Theta_{t|t-1}, \Sigma_{t|t-1})$, as well as the updated estimate of the latent state and its expected error $(\Theta_{t|t}, \Sigma_{t|t})$.

We write our quantities of interest in pseudocode as $$ \boxed{ \begin{aligned} &\text{// Predict step}\\ &\Theta_{t|t-1} = \vA_{t|t-1}~y_{1:t-1},\\ &\Sigma_{t|t-1} = \var(\Theta_t - \Theta_{t|t-1}),\\ &\text{// Update step}\\ &\Theta_{t|t} = \vA_{t|t}~y_{1:t},\\ &\Sigma_{t|t} = \var(\Theta_t - \Theta_{t|t}). \end{aligned} } $$ As we will see, the KF algorithm is a recursive and efficient way to estimate the sequence of predicted and updated pairs. $$ (\Theta_{0|0}, \Sigma_{0|0}) \xrightarrow{\text{predict}} (\Theta_{1|0}, \Sigma_{1|0}) \xrightarrow{\text{update}} (\Theta_{1|1}, \Sigma_{1|1}) \xrightarrow{\text{predict}} (\Theta_{2|1}, \Sigma_{2|1}) \xrightarrow{\text{update}} \cdots $$

The Kalman filter algorithm

Assuming that $\Theta_{0|0}$ and $\Sigma_{0|0}$ are known, and the measurements $y_{1:t}$ arrive in a sequence, the KF algorithm first makes a a predict step that computes the pair $(\Theta_{t|t-1}, \Sigma_{t|t-1})$, and then makes an update step that computes the pair $(\Theta_{t|t}, \Sigma_{t|t})$.

We formalise the predict and step updates in the following propositions. Below, Proposition 1 shows how to compute $(\Theta_{t|t-1}, \Sigma_{t|t-1})$, Proposition 2 shows how to compute $\Theta_{t|t}$, and Proposition 3 shows how to compute $\Sigma_{t|t}$.

Proposition 1: The Kalman filter prediction

Let $\Sigma_{1|0} = \var(\Theta_1)$. The predicted BLUP and EVC take the form

$$ \begin{aligned} \Theta_{t|t-1} &= \vF_{t-1}\,\Theta_{t-1|t-1},\\ \Sigma_{t|t-1} &= \vF_{t-1}\,\Sigma_{t-1|t-1}\,\vF_{t-1}^\intercal + Q_{t-1}. \end{aligned} $$

For a proof, see the Appendix.

Proposition 2: The Kalman filter mean update

Let $(\Theta_{t|t-1}, \Sigma_{t|t-1})$ be the predicted BLUP and predicted EVC respectively, $\vH_t$ be the projection matrix, $R_t$ be the observation covariance, and $y_t$ be the observation at time $t$. The updated BLUP $\Theta_{t|t}$ can be written in one of two ways $$ \Theta_{t|t} = \begin{cases} \Theta_{t|t-1} + \vK_t\,\cal{E}_t,\\ \vK_t\,y_t + (\vI - \vK_t\,\vH_t)\,\Theta_{t|t-1},\\ \end{cases} $$ where $$ \begin{aligned} {\cal E}_t &= y_t - \vH_t\,\Theta_{t|t-1},\\ \vK_t &= \Sigma_{t|t-1}\vH_t^\intercal\,S_t^{-1},\\ S_t &= \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t. \end{aligned} $$ Here, ${\cal E}_t$ is the innovation at time $t$, $\vK_t$ the Kalman gain matrix, and $S_t$ is the variance of the innovation at time $t$.

For a proof, see the Appendix.

Proposition 3: The Kalman filter covariance update

Let $\vSigma_{t|t-1}$ be the predicted EVC. The updated EVC matrix $\Sigma_{t|t}$ can be written in one of three ways: $$ \Sigma_{t|t} = \begin{cases} \left[\vI - \vK_t\,\vH_t\right]\,\Sigma_{t|t-1}\,\left[\vI - \vK_t\,\vH_t\right]^\intercal + \vK_t R_t\,\vK_t^\intercal,\\ \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1},\\ \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal. \end{cases} $$

For a proof, see the Appendix.

The algorithm

Propositions 1 - 3 encapsulate the KF algorithm. We show the KF pseudoalgorithm below.

Take $\Sigma_{1|0} = \var(\Theta_1) = \vF_0~\Sigma_{0|0}~\vF_0^\intercal = Q_0$ and assume $\Theta_{0|0}$ is known. Then, for $t=1, \ldots, T$, a step of the KF algorithm is given by

$$ \boxed{ \begin{array}{l|l} \text{// Predict step}\\ \Theta_{t|t-1} = \vF_t\,\Theta_{t-1|t-1} & \cov(\Theta_t, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,y_{1:t-1}\\ \Sigma_{t|t-1} = \vF_t\,\Sigma_{t-1|t-1}\,\vF_t^\intercal + Q_t & \var(\Theta_t - \Theta_{t|t-1})\\ Y_{t|t-1} = \vH_t~\Theta_{t|t-1} & \cov(Y_t, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,y_{1:t-1}\\ \text{// Gain and innovation} \\ S_t = \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t & \var({\cal E}_t)\\ \vK_t = \Sigma_{t|t-1}\vH_t^\intercal\,S_t^{-1} & \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1} \\ {\cal E}_t = y_t - Y_{t|t-1}\\ \text{// Update step}\\ \Theta_{t|t} = \Theta_{t|t-1} + \vK_t\,{\cal E}_t & \cov(\Theta_t, Y_{1:t})\,\var(Y_{1:t})^{-1}\,y_{1:t}\\ \Sigma_{t|t} = \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal & \var(\Theta_t - \Theta_{t|t}) \end{array} } $$ In the pseudocode above, the term $Y_{t|t-1}$ denotes the one-step-ahead forecast of the measurement $y_t$ having observations $y_{1:t-1}$.

Python implementation

The pseudoalgorithm above can be readily implemented in Python. Here, we define a jax.lax.scan-compatible function, which we use in the experiments below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def kalman_filter_step(bel, y, H, F, R, Q):
    """
    jax.lax.scan compatible step function to run the
    Kalman filter with constant and known dynamics (H, F, R, Q).

    Parameters
    ----------
    bel: ((d,), (d,d)) tuples of arrays --- BLUP and EVC from previous step
    y: (o,) array --- observation (measurement)
    H: (o, d) array --- projection matrix
    F: (d, d) array --- transition matrix
    R: (o, o) array --- observation covariance
    Q: (d, d) array --- dynamics covariance
    """
    mu, Sigma = bel

    # Proposition 1
    # Predict mean and covariance
    mu_pred = F @ mu
    Sigma_pred = F @ Sigma @ F.T + Q
    yhat = H @ mu_pred

    # Proposition 2
    # Innovation, variance of innvation, and the Kalman gain
    err =  y - yhat
    S = H @ Sigma_pred @ H.T + R
    K = jnp.linalg.solve(S, H @ Sigma_pred).T

    # Proposition 3
    # Update mean and covariance
    mu = mu_pred + K @ err
    # can be either of the forms in Proposition 3
    Sigma = Sigma_pred - K @ S @ K.T 

    bel_next = (mu, Sigma)
    out = {
        "mu": mu,
        "yhat": yhat # one-step-ahead forecast
    }

    return bel_next, out

The implementation above obtains the Kalman gain matrix $\vK_t$ through solve(S, H @ Sigma_pred.T). This step avoids having to invert the matrix S explicitly. Instead, we observe that the Kalman gain can be written as $$ \vK_t = \Sigma_{t|t-1}~\vH_t^T~S_t^{-1} \iff \vK_t~S_t = \Sigma_{t|t-1}\vH_t^\intercal $$ If we transpose each of the terms in the equation above, we obtain the system of equations $$ S_t~\vK_t^\intercal = \vH_t~\Sigma_{t|t-1} $$ which takes the form $\vA ~ \vX = \vB$, with $\vA$ a $(o\times o)$ positive definite matrix, $\vB$ a known $(o \times m)$ matrix, and $\vX$ the unknown $(o \times m)$ matrix. Thus, we use the solve function to find $\vX = \vK_t^\intercal$.

Experiments

Here, we present two use cases for the Kalman filter: time-series forecasting and sequential (online) learning.

In the first experiment, we use the KF algorithm to perform multi-step forecasting of a moving average (MA) model. We show that if the underlying statistical model can be written as a linear SSM, then sequential forecasting can be readily done using the KF equations.

In the second experiment, we show that the flexibility of the SSM assumption allows us to tackle a harder problem: multi-step forecasting and online estimation of the coefficients of a time series model.

Experiment 1: The Kalman filter for forecasting

We consider the classical problem of time-series forecasting. See this notebook for the code we use in this experiment.

Suppose we are given observations $y_{1:T}$ sampled from some SSM process. Having seen observations up to time $t$, our goal is to make $k \geq 1$ future estimates of the observations, i.e., having $y_{1:t}$, we seek, to construct the $k$-dimensional vector of forecasts $Y_{t+1:t+k|t} = Y_{1+t|t}, \ldots, Y_{t+k|t}$, where $Y_{t+k|t} = \vH_{t+k}~\Theta_{t+k|t}$. Because $\vH_t$ is known for all $t$, a $k$-step-step ahead forecast requires only the estimation of the BLUP $\Theta_{t+\ell|t}$ for $\ell=1,\ldots, k$ (we relax this assumption in Experiment 2).

Environment setup: the moving-average (MA) model

Take the observations $y_{1:T}$ sampled from an $m$-th order moving average (MA($m$)) process with known $m\geq 1$ and known SSM coefficients. As we saw in previous post, the SSM form of an MA($m$) process is $$ \tag{MA.1} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ Y_t &= \vH_{\rm MA}\,\Theta_t + E_t. \end{aligned} } $$ In this experiment, we take the $(1\times m)$ matrix of coefficients $\vH_{\rm MA} = [1, 1, \ldots, 1]$, $\var(E_t) = R_t = 1.0$ for all $t$, and $\vF_{\rm MA}$ and ${\cal T}_m$ are defined in F.02, MA process.

In the code below, we specify the configuration of the MA process we use to sample the sequence of observations.

1
2
3
4
5
6
# Define SSM configuration
ma_n = 10
R = jnp.eye(1) * 1.0 ** 2
H_ma = jnp.ones(10) * 1.
H, F, T = init_ma_components(H_ma=H_ma)
Q = T @ R @ T.T # Qt = var(T E_t)

The function init_ma_components is defined in F.02, MA process. A sample of an MA($m$) process with $m=10$ and $E_t \sim {\cal N}(0, R_t)$ is shown below. ² ma(10) sample

Filtering and one-step-ahead forecasts

Consider a sample run of observations $y_{1:t}$ which are stored in a t-dimensional array y. Then, consecutive one-step-ahead forecast of the time-series, i.e, $Y_{1|0}, Y_{2|1}, \ldots, Y_{t|t-1}$, can be done by simply running the KF algorithm. This is because the innovation term in the KF is the difference between the observation $y_t$ and the one-step-ahead forecast $Y_{t|t-1} := \vH_{\rm MA}~\Theta_{t|t-1}$.

In code, we find these one-step-ahead forecasts by running the kalman_filter_step and storing y_hat.

1
2
3
4
5
6
7
mu_init = jnp.zeros(ma_n) # initialise mean mu(0|0)
Sigma_init = jnp.eye(ma_n) # initialise covariance Sigma(0|0)
bel_init = (mu_init, Sigma_init)

# Initialise step function and run
_step = partial(kalman_filter_step, H=H[None, :], F=F, R=R, Q=Q)
(mu_final, Sigma_final), hist = jax.lax.scan(_step, bel_init, y[:, None])

The output hist["y_corrected"] stores the filtered (updated) estimate of the observations, i.e. $Y_{t|t}$, and the output hist["yhat"] stores the one-step-ahead forecast, i.e., $Y_{t|t-1}$.

The figure below shows the sampled MA($m$) process (y) as well as the one-step-ahead forecast hist["yhat"]. The black lines shows the distance between the forecasts and the realised values. ma(10) sample The takeaway from this plot is that one-step-ahead predictions are obtained as a by-product of rewriting the model as an SSM and running the KF algorithm. We can easily extend this idea to multi-step forecasting. We show this next.

Multi-step-ahead forecast

In various time-series problems, the goal is to obtain a multi-step forecasts $Y_{t + k|t}$ for fixed $t$ and varying $k \geq 1$. Following Proposition 2.2 in F.02, the $k$-step ahead forecast estimate $\Theta_{t+k|t}$ is obtained by pre-multiplying $\Theta_{t|t}$ with the next $k$ transition matrices, i.e., $\Theta_{t+k|t} = (\prod_{\tau=1}^k \vF_{t+\tau})~\Theta_{t|t}$. Then, the $k$-step ahead forecast $Y_{t+k|t}$ is given by pre-multiplying the predicted state $\Theta_{t+k|t}$ times $\vH_{t+k}$, i.e., $Y_{t+k|t} = \vH_{t+k}~\Theta_{t+k|t}$. ³

We specify how to carry out multiple $k$-step ahead predictions for the MA model in the code snippet below. The required inputs for the function are the state $\Theta_{t|t}$, the known transition matrix $\vF_{\rm MA}$, the known projection matrix $\vH_{\rm MA}$, and the number of forecasts $k$.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def forecast_obs_mean(F, H, mu, k):
    """
    Following Proposition 2.2 in F.02.
    This function defines an inner step function
    that pre-multiplies H in the carry output of the _step function.

    Parameters
    ----------
    F: (d,d) array --- transition matrix
    H: (o,d) array --- projection matrix
    mu: (d,) array --- BLUP theta(t|t)
    k: int --- number of forecasts
    """
    def _step(mu, _):
        mu_next = F @ mu # forecast latent variable
        y_pred = H @ mu_next # forecast observation
        return mu_next, y_pred
    
    # Specify number of steps and run _step function
    # which returns Y_{t|k} for varying k.
    steps = jnp.arange(k)
    _, y_pred = jax.lax.scan(_step, mu, steps)
    
    # Append initial update Y_{t|t}
    y_init = H @ mu
    y_pred = jnp.concat([y_init[:, None], y_pred])
    return y_pred

Next, a time-series of forecasted values can be obtained by modifying the out dictionary in kalman_filter_step to be

1
2
3
4
    out = {
        "y_filter": H @ mu_update,
        "y_forecast": forecast_obs_mean(F, H, mu_update, n_forecast),
    }

with n_forecast >= 1.

The following plot shows the updated estimate for $y_t$, namely $Y_{t|t}$ (hist["y_filter"]) and the $10$-step-ahead forecasts $Y_{t+k|t}$ for $k=1,\ldots, 10$ (hist["y_forecast"]). The blue lines correspond to forecasts where $Y_{t|t} > 0$ and the red lines correspond to forecasts where $Y_{t|t} < 0$. ma(10) multi-step-forecast We can see a clear pattern for the MA($m$) forecast which we can roughly summarise as follows: if $Y_{t|t}$ is above zero, predict a downward trend; if $Y_{t|t}$ is below zero, predict an upward trend.

Multi-step error forecast

Finally, we combine our forecasts $Y_{t+k|t}$ with the expected forecasted error $\var(Y_{t+k} - Y_{t+k|t})$. In particular, we compute the error bound $Y_{t+k|t} \pm \sqrt{\var(Y_{t+k} - Y_{t+k|t})}$ that quantifies our uncertainty about the value of the future observation (under assumption of symmetry of error).

The estimated forecasted error $\var(Y_{t+k} - Y_{t+k|t})$ can be computed directly from the predicted EVC $\Sigma_{t+k|t}$ derived in Proposition 2.2 in F.02. This is because $$ \begin{aligned} \var(Y_{t+k} - Y_{t+k|t}) &= \var\left(\vH_{t+k}\,\Theta_{t+k} + E_{t+k} - \vH_{t+k}\,\Theta_{t+k|t}^\intercal\right)\\ &= \var\left(\vH_{t+k}\,\Theta_{t+k} - \vH_{t+k}\,\Theta_{t+k|t}^\intercal\right) + \var(E_{t+k})\\ &= \vH_{t+k}\,\var\left(\Theta_{t+k} - \Theta_{t+k|t}\right)\,\vH_{t+k}^\intercal + R_t\\ &=\vH_{t+k}\,\Sigma_{t+k|t}\,\vH_{t+k}^\intercal + R_t, \end{aligned} $$ where the second equality follows from property (P.2) in Proposition 2.1 in F.02. We code this result as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def forecast_obs_cov(F, Sigma, Q, R, H, k):
    """
    Compute sequence of forecasted error covariance
    Following Proposition 2.3 in F.02

    Parameters
    ----------
    F: (d,d) array --- transition matrix
    Sigma: (d,d) array --- EVC Sigma(t|t)
    Q: (d,d) array --- dynamics covariance
    R: (o,o) array --- observation covariance
    H: (o,d) array --- projection matrix
    k: int --- number of forecasts
    """
    def _step(state, _):
        Sigma_mult, dynamics_carry, F_mult = state
        dynamics_carry = dynamics_carry + F_mult @ Q @ F_mult.T
        F_mult = F @ F_mult
        Sigma_mult = F @ Sigma_mult @ F.T
        # EVC
        Sigma_pred = Sigma_mult + dynamics_carry
        # Predicted error
        var_pred = H @ Sigma_pred @ H.T + R
        state_next = (Sigma_mult, dynamics_carry, F_mult)
        return state_next, var_pred

    steps = jnp.arange(k)
    dim_state = len(Q)
    dynamics_init = jnp.zeros((dim_state, dim_state))
    state_init = (Sigma, dynamics_init, F)
    _, var_pred = jax.lax.scan(_step, state_init, steps)
    return var_pred

The following figure shows the updated (filtered) steps $Y_{t|t}$ in black, the forecasted observations $Y_{t+k|t}$ in red, its error bounds with two standard deviations in the red area, and the true future observations $Y_{t+k}$ in blue. ma(10) forecasted BLUP and error standard deviation

Experiment 2: Kalman filtering for (online) learning and forecasting

The previous experiment showed that knowing the coefficients of an MA process ($\vH_{\rm MA}$) enables efficient sequential forecasting of the time-series. In real life scenarios, however, we rarely (if ever) get to observe the true coefficients of an MA process. Consequently, they must be estimated. See this notebook for the code we use in this experiment.

The classical approach to estimating the coefficients of traditional time series models, such as the MA model (assuming that the noise terms $E_t$ are zero-mean Gaussians), is through maximum-likelihood estimation: given some some training data of past observations $y_{1:t}$, we aim to find the matrix $\vH$ that maximises the likelihood of the data having been generated by $\vH$. ⁴ ⁵

In this experiment, however, we adopt an online approach to estimating $\vH_{\rm MA}$. Because the observations $y_{1:T}$ arrive sequentially (one datapoint at a time), we continuously update our estimate as new data streams in.

As we show below, sequential estimation of the matrix $\vH_{\rm MA}$ and sequential forecasting of $Y_{t+k}$ can be achieved jointly if we introduce an additional latent variable $\vH_t$ that evolves over time. As a consequence, we obtain an extended SSM, over which we can filter to obtain estimates $(\Theta_{t+k|t}, \vH_{t+k|t}, Y_{t+k|t})$ for fixed $t$ and varying $k\geq 1$.

This mode of learning via an extended SSM allows us to process time series data in real time and adapt to shifts in the data generating process.⁶ While online learning ARMA-style models has been studied in previous works, ⁷ ⁸ the method here follows a simplified approach based on a linearisation technique known as the extended Kalman filter.⁹

Problem setup

Suppose we are given a sequence of observations $y_{1:T}$, which we are told is sampled from an MA($m$) process with known $m$, but unknown coefficients $\vH_{\rm MA} = [a_1, \ldots, a_m]$. Our task is to estimate the future observations $Y_{t+k|t}$ given $y_{1:t}$. Because we do not know the coefficients of the MA($m$) process, computation of future observations $Y_{t+k|t}$ requires the estimation of (i) the latent vector $\Theta_{t}$, as well as (ii) the coefficients (or parameters) $\vH_{\rm MA}$.

A practical approach to handling SSMs with unknown system components is to expand the SSM by incorporating those unknowns into the latent process. In our case, for example, the unknown component is the vector of coefficients $\vH_{\rm MA}$. Thus, we expand $({\rm MA.1})$ to include an $m$-dimensional vector $\vH_t$ that follows a random walk, and which represents the evolution of the coefficients. Under these considerations, the expanded SSM takes the form $$ \tag{MA.2} \boxed{ \begin{aligned} \vH_t &= \vH_{t-1} + J_t,\\ \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ Y_t &= \vH_t^\intercal\,\Theta_t + E_t. \end{aligned} } $$ Where $\vH_t \in \reals^{m}$ is a column vector (the transpose of $\vH_{\rm MA}$), $\mathbb{E}[J_t] = 0$ and $\var(J_t) = O_t$, with $O_t$ an $m\times m$ positive-definite covariance matrix. Note that the true MA coefficients $\vH_{\rm MA}$ do not evolve over time. The assumption of a random walk for $\vH_t$ is simply to prevent fast (and possibly biased) convergence of the coefficients, as well as to provide numerical stability.

The form for $Y_t$ in $({\rm MA.2})$ is said to be in bilinear form.¹⁰ ¹¹ Because of this, standard Kalman filtering over $(\vH_t, \Theta_t)$ is not possible.

A classical approach to deal with this problem is to approximate the observation model as a linear model through a first-order approximation of the observation $Y_t$ around the previous estimates $\vH_{t-1|t-1}$ and $\Theta_{t-1|t-1}$. This linearisation procedure dates back to the 1960s and is known as the Extended Kalman filter. ¹² For the MA model, linearisation around $(\vH_{t-1|t-1}, \Theta_{t-1|t-1})$ takes the form $$ \tag{Y.1} \begin{aligned} \bar{Y}_t &= \vH_{t-1|t-1}^\intercal\,\Theta_{t-1|t-1} + \vH_{t-1|t-1}^\intercal\,\left( \Theta_t - \Theta_{t-1|t-1} \right) + \Theta_{t-1|t-1}^\intercal\,\left(\vH_t - \vH_{t-1|t-1}\right)\\ &= \vH_{t-1|t-1}^\intercal\,\Theta_t + \Theta_{t-1|t-1}^\intercal \vH_t - \vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}. \end{aligned} $$

$\vH_t \in \reals^{m\times 1}$, $\Theta_t \in \reals^{m\times 1}$.

The expanded SSM in KF-compatible form

With the assumptions in $({\rm MA.2})$ and the linearised measurement $({\rm Y.1})$, the new linearised SSM takes the form $$ \tag{MA.3} \boxed{ \begin{aligned} \vH_t &= \vH_{t-1} + J_t,\\ \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ \bar{Y}_t &= \vH_{t-1|t-1}^\intercal\,\Theta_t + \Theta_{t-1|t-1}^\intercal \vH_t - \vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}. \end{aligned} } $$

Next, to perform Kalman filtering, we rewrite $({\rm MA.3})$ in the form $({\rm SSM.1})$. Doing so will allow us to make use of the kalman_filter_step function defined above.

We begin by introducing an expanded latent variable, an expanded transition matrix, and an expanded projection matrix: $$ \begin{aligned} \vW_t = \begin{bmatrix} \vH_t \\ \Theta_t \end{bmatrix}, & & \vA = \begin{bmatrix} \vI_m & \vzero \\ \vzero & \vF_{\rm MA} \end{bmatrix}, & & \vC_t = \begin{bmatrix} \Theta_{t-1|t-1}^\intercal & \vH_{t-1|t-1}^\intercal \end{bmatrix}. \end{aligned} $$ Then, we define the dynamics noise $\vB~U_t$ with $$ \begin{aligned} \vB = \begin{bmatrix} \vI_m & \vzero \\ \vzero & {\cal T}_m \end{bmatrix}, & & U_t = \begin{bmatrix} J_t \\ E_t \end{bmatrix}. \end{aligned} $$ Next, the SSM $({\rm MA.3})$ is written in canonical SSM form as $$ \tag{MA.4} \boxed{ \begin{aligned} \vW_t &= \vA\,\vW_{t-1} + \vB~U_t\\ \bar{Y}_t &= \vC_t\,\vW_t - \vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}. \end{aligned} } $$

We see that $({\rm MA.4})$ resembles $({\rm SSM.1})$ up to a matrix that multiplies the dynamics of the latent state $U_t$ and a bias term in the measurements that takes the form $\vH_{t-1|t-1}^\intercal\Theta_{t-1|t-1}$. We also require the dynamics covariance matrix $Q_t$ and the innovations ${\cal E}_t$. For the former, we have $$ \begin{aligned} \var(\vB\,U_t) &= \vB\,\var(Ut)\,\vB^\intercal\\ &= \vB\, \begin{bmatrix} O_t & \vzero\\ \vzero & R_t \end{bmatrix} \,\vB^\intercal\\ &= \begin{bmatrix} \vI_m & \vzero \\ \vzero & {\cal T}_m \end{bmatrix} \begin{bmatrix} O_t & \vzero\\ \vzero & R_t \end{bmatrix} \begin{bmatrix} \vI_m & \vzero \\ \vzero & {\cal T}_m^\intercal \end{bmatrix} \\ &= \begin{bmatrix} O_t & \vzero \\ \vzero & {\cal T}_m\,R_t\,{\cal T}_m^\intercal \end{bmatrix}\\ &= Q_t. \end{aligned} $$

For the latter, we have $$ {\cal E}_t = Y_t - (\vC_t\,\vW_{t|t-1} - \vH_{t-1|t-1}^\intercal\,\Theta_{t-1|t-1}). $$

Finally, setting up the code for the new SSM that we can use with the kalman_filter_step function becomes a straightforward (but tedious) coding exercise. See the definition of KFState and kf_step_inner in the notebook for details.

By filtering $({\rm MA.4})$, we are performing both forecasting of the time-series and learning of the MA parameters online, i.e.,one datapoint at a time. Of course, one can expect worse performance using the learned $\vH_{t|t}$ than the true matrix $\vH_{\rm MA}$. However, these forecasts should gradually improve over time.

Next, we analyse the performance of filtering the expanded SSM on the sampled dataset shown above.

Results

We analyse the result of filtering the expanded SSM with the sampled series $y_{1:T}$ shown in the previous experiment. We begin by plotting the evolution of the filtered (or learned) parameters $\vH_{t|t}$ over time. online-ma-params One of the most salient points from the Figure above is that the estimated model parameters do not convergence to the true model parameters $\vH_{\rm MA}$ which, in this experiment, is set to be a vector of ones. This is because of the so-called observability condition for Kalman filtering,¹³ which states a condition that the SSM must have to recover the true value of the state. As it turns out, our choice of SSM is non-observable. One easy check for this is to note that we obtain the same prediction at time $t$ if we scale $\vH_{t|t}$ by $\alpha\,\vH_{t|t}$ and $\Theta_{t|t}$ by $1/\alpha\,\Theta_{t|t}$, for some $\alpha > 0$.

One-step-ahead forecast

Despite the non-observability of our SSM, we can still get meaningful predictions. The following plot shows the (best) one-step-ahead prediction using $({\rm MA.1})$, the (learned) one-step-ahead prediction using $({\rm MA.4})$, and the true one-step-ahead value.

As expected, at the begining of the experiment, the one-step-ahead forecasts using the learned $\vH_{t|t}$ do not align with the forecasted valued using the true $\vH_{\rm MA}$. However, these gradually start to resemble the best predictions. To validate this observation, the following plot shows the 100-step rolling root mean squared error between the one-step-ahead forecast and the true forecast.

We see that at the beginning of the experiment, the RMSE of the learned projection matrix $\vH_{t|t}$ is much higher than the optimal $\vH_{\rm MA}$. However, as we process more data, the RMSE for prediction using the learned projection matrix starts to match the true projection matrix.

Multi-step forecast

Finally, we test the performance of filtering $({\rm MA.4})$ for multi-step forecasting. To evaluate the performance of the multi-step-ahead forecast, we compute the correlation between the $k$-step-ahead predictions and the observed values, i.e., $$ \rho_k = \frac {\sum_t (Y_{t+k|t} - \bar{Y}_{t+k})(y_{t+k} - \bar{y}_{t+k})} {\left(\sum_t (Y_{t+k|t} - \bar{Y}_{t+k})^2~ \sum_t (y_{t+k} - \bar{y}_{t+k})^2\right)^{1/2}} $$ for varying $k$. Here, $ \bar{Y}_{t+k} = \frac{1}{T-k}\sum_t Y_{t+k|t} $, and $ \bar{y}_{t+k} = \frac{1}{T-k}\sum_t y_{t+k} $. The sums run from $t=1$ to $T-k$.

Forward correlation comparison The figure above shows that the correlation coefficient $\rho_k$ using the true $\vH_{\rm MA}$ has similar performance to that of using the learned $\vH_{t|t}$ matrix.

Appendix

Recall the following properties of an SSM and innovations

(P.1) $Y_{1:t} = \vL\,{\cal E}_{1:t}$ — measurement / innovation relationship,
(P.2) ${\cal E}_t = Y_t - \vH_t\,\Theta_{t|t-1}$ — innovation under an SSM,
(P.3) $\var(Y_{1:t}) = \vL\,{S}_{1:t}\,\vL^\intercal$ — Cholesky decomposition for variance of measurements,
(P.4) $\var({\cal E}_{1:t}) = S_{1:t} = {\rm diag}(S_1, \ldots, S_t)$ — variance of innovations, and
(P.5) $\cov(\Theta_t, {\cal E}_t) = \Sigma_{t|t-1}\,\vH_t^{\intercal}$.

Next, recall that the BLUP and EVC matrix take the form $$ \begin{aligned} \Theta_{t|k} &= \cov(\Theta_t, Y_{1:k})\,\var(Y_{1:k})^{-1}\,Y_{1:k}\\ \Sigma_{t|k} &= \var(\Theta_t - \Theta_{t|k}). \end{aligned} $$

Proof of Proposition 1

Proposition 1 is a special case of Proposition 2.2, F.02 and Proposition 2.3, F.02, however, we derive them here for completeness.

First, the predicted BLUP gets rewritten as: $$ \begin{aligned} \Theta_{t|t-1} &= \cov(\Theta_t, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,Y_{1:t-1}\\ &= \cov(\vF_{t-1}\,\Theta_{t-1} + U_{t-1}, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,Y_{1:t-1}\\ &= \vF_{t-1}\,\cov(\Theta_{t-1}, Y_{1:t-1})\,\var(Y_{1:t-1})^{-1}\,Y_{1:t-1}\\ &= \vF_{t-1}\,\Theta_{t-1|t-1}, \end{aligned} $$ where the third equality follows from condition (P.3) in Proposition 2.1, F.02.

Next, the predicted EVC takes the form $$ \begin{aligned} \Sigma_{t|t-1} &= \var(\Theta_t - \Theta_{t|t-1})\\ &= \var\left(\vF_{t-1}\,\Theta_{t-1} + U_{t-1} - \vF_{t-1}\,\Theta_{t-1|t-1}\right)\\ &= \var\left(\vF_{t-1}\,\Theta_{t-1} - \vF_{t-1}\,\Theta_{t-1|t-1}\right) + \var(U_{t-1})\\ &= \vF_{t-1}\,\var\left(\Theta_{t-1} - \Theta_{t-1|t-1}\right)\,\vF_{t-1}^\intercal + Q_{t-1}\\ &= \vF_{t-1}\,\Sigma_{t-1|t-1}\,\vF_{t-1}^\intercal + Q_{t-1}. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 2

Here, we show Proposition 2.

First, recall the additive representation of the BLUP form for $\Theta_{t|t}$: $$ \begin{aligned} \Theta_{t|t} &=\cov(\Theta_t, Y_{1:t})\var(Y_{1:t})^{-1}\,Y_{1:t}\\ &=\cov(\Theta_t, \vL{\cal E}_{1:t})(\vL\,S_{1:t}\,\vL^\intercal)^{-1}\,\vL{\cal E}_{1:t}\\ &= \cov(\Theta_t, {\cal E}_{1:t})\vL^\intercal\,\vL^{-\intercal}\vS_{1:t}^{-1}\vL^{-1}\,\vL\,{\cal E}_{1:t}\\ &= \cov(\Theta_t, {\cal E}_{1:t})\,{\rm diag}(S_1^{-1}, \ldots, S_t^{-1})\,{\cal E}_{1:t}\\ &= \begin{bmatrix} \cov(\Theta_t, {\cal E}_1) & \ldots & \cov(\Theta_t, {\cal E}_t) \end{bmatrix} \begin{bmatrix} S_1^{-1}\,{\cal E}_1 \\ \vdots \\ S_t^{-1}{\cal E}_t \end{bmatrix}\\ &=\sum_{k=1}^t \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k, \end{aligned} $$ where have used Properties (P.1), (P.3), and (P.4) above.

Then, we decomopose the BLUP into its predicted and update forms $$ \begin{aligned} \Theta_{t|t} &= \sum_{k=1}^t \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k\\ &= \underbrace{ \cov(\Theta_t, {\cal E}_t) }_{\Sigma_{t|t-1}\,\vH_t^\intercal} \,S_t^{-1}\,{\cal E}_t + \underbrace{ \sum_{k=1}^{t-1} \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k }_{\Theta_{t|t-1}}\\ &= \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}{\cal E}_t + \Theta_{t|t-1}\\ &=\Theta_{t|t-1} + \vK_t\,{\cal E}_t\\ &= \vK_t\,(Y_t - \vH_t\,\Theta_{t|t-1}) + \Theta_{t|t-1}\\ &= \vK_t\,Y_t + (\vI - \vK_t\,\vH_t)\Theta_{t|t-1}, \end{aligned} $$ where the second equality follows from property (P.5), the fifth equality follows from property (P.2), $\vK_t = \Sigma_{t|t-1}\vH_t^\intercal S_t^{-1}$, and $S_t = \vH_t,\Sigma_{t|t-1},\vH_t^\intercal + R_t$ was derived in Proposition 2.5, F.02. $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 3

Here, we show Proposition 3

Recall $\Sigma_{t|t-1} = \var(\Theta_t - \Theta_{t|t-1})$ and $\Theta_{t|t} = \Theta_{t|t-1} + \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,{\cal E}_t$. Then, $$ \Sigma_{t|t} = \var(\Theta_t - \Theta_{t|t}) = \var\left(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)S_t^{-1}\,{\cal E}_t\right). $$

Form 1 (also known as the Joseph form ¹⁴) follows from Property (P.3) and rewriting the expression above in terms of $\Sigma_{t|t-}$ and $R_t$: $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)S_t^{-1}\,{\cal E}_t)\\ &= \var((\Theta_t - \Theta_{t|t-1}) - \vK_t\,{\cal E}_t)\\ &= \var((\Theta_t - \Theta_{t|t-1}) - \vK_t\,(\vH_t[\Theta_t - \Theta_{t|t-1}] + E_t))\\ &= \var((\Theta_t - \Theta_{t|t-1}) - \vK_t\,\vH_t[\Theta_t - \Theta_{t|t-1}] + \vK_t\,E_t)\\ &= \var\Big([\vI - \vK_t\,\vH_t](\Theta_t - \Theta_{t|t-1}) - \vK_t\,E_t\Big)\\ &= [\vI - \vK_t\,\vH_t]\var(\Theta_t - \Theta_{t|t-1})[\vI - \vK_t\,\vH_t]^\intercal + \vK_t\var(E_t)\vK_t^\intercal\\ &= [\vI - \vK_t\,\vH_t]\Sigma_{t|t-1}[\vI - \vK_t\,\vH_t]^\intercal + \vK_t R_t\vK_t^\intercal, \end{aligned} $$ where the sixth line follows from Property (P.2) in F.02.

Alternatively, form 2 follows from Property (P.5) and rewriting the EVC in terms of $\Sigma_{t|t-1}$ and $S_t$: $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,{\cal E}_t)\\ &= \var(\Theta_t - \Theta_{t|t-1}) - \var(\cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,{\cal E}_t)\\ &= \var(\Theta_t - \Theta_{t|t-1}) - \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,\,\var({\cal E}_t)\,S_t^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &= \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}. \end{aligned} $$

Finally, form 3 is a direct consequence of form 2. From the expression above, we have $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t - \Theta_{t|t-1}) - \cov(\Theta_t, {\cal E}_t)\,S_t^{-1}\,\,\var({\cal E}_t)\,S_t^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &= \var(\Theta_t - \Theta_{t|t-1}) - \vK_t\,\var({\cal E}_t)\,\vK_t^\intercal\\ &= \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

See $({\rm BLUP.1})$ and $({\rm EVC.2})$ in F.02 ↩︎
in general, the noise term $E_t$ does not need to be Gaussian, but this assupmtion is useful for sampling and optimal filtering. ↩︎
This illustrates that the class of AR/MA/ARIMA and related models have quite a big inductive bias embedded in their assumption, which might explain why, if they are well-calibrated, they can match or outperform their (non-linear) neural network counterparts in the single-dataset setting, which, in contrast, have to find a good feature transformation for the evolution for the state using either limited or noisy information, see e.g., https://cbergmeir.com/talks/neurips2024/. ↩︎
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation ↩︎
McLeod, A. Ian, and Ying Zhang. “Faster ARMA maximum likelihood estimation.” Computational Statistics & Data Analysis 52.4 (2008): 2166-2176. ↩︎
Duran-Martin, G., et al. “A unifying framework for generalised Bayesian online learning in non-stationary environments.” Transactions on Machine Learning Research (2025). ↩︎
Anava, Oren, et al. “Online learning for time series prediction.” Conference on learning theory. PMLR, 2013. ↩︎
Liu, C., Hoi, S. C., Zhao, P., & Sun, J. (2016). Online ARIMA Algorithms for Time Series Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). ↩︎
The name extended is not due the extended SSM $(\vH_t, \Theta_t)$ we work with, but rather, it refers to extending the KF to work with non-linear state-space models. ↩︎
Benesty, Jacob, Constantin Paleologu, and Silviu Ciochină. “On the identification of bilinear forms with the Wiener filter.” IEEE Signal Processing Letters 24.5 (2017): 653-657. ↩︎
Harville, David A. “Matrix algebra from a statistician’s perspective.” (1998): Chapter 14. ↩︎
Battin, Richard H. “Space guidance evolution-a personal narrative.” Journal of Guidance, Control, and Dynamics 5.2 (1982): 97-110. ↩︎
Southallzy, B., B. Buxtony, and J. Marchant. “Controllability and observability: Tools for Kalman filter design.” British Machine Vision Conference. Vol. 98. 1998. ↩︎
Zanetti, Renato, and Kyle J. DeMars. “Joseph formulation of unscented and quadrature filters with application to consider states.” Journal of Guidance, Control, and Dynamics 36.6 (2013): 1860-1864. ↩︎