Filtering notes (II): state-space models

Posted on Mar 6, 2025
tl;dr: Introduction to linear state-space models: basic properties, measures of uncertainty, and the ARMA model.

Part II of the filtering notes series:

Introduction

Whether modelling the trajectory of a rocket, forecasting economic trends, or powering large language models, discrete linear state-space models (SSMs) provide a unifying framework for estimation, prediction, and learning.

At their core, a linear SSM models observations as projections of an unknown, time-dependent latent process.1 A classic example is object tracking, where noisy measurements—like the angle between a rocket and two stars—are used to reconstruct the rocket’s trajectory.2 A simpler toy example of an SSM is presented in figure below, where we show the trajectory a 2D particle with noisy observations: here, the solid line represents the true (but unknown) path, while the dots denote the noisy measurements.

Thanks to their mathematical tractability and computational efficiency, linear SSMs extend far beyond tracking. They appear in time-series forecasting, online learning, and even in large language models.

In this post, we explore the mathematical foundations that make SSMs tractable and practical. We outline the assumptions needed for an SSM to be well-specified and highlight a key result: error uncertainty in an SSM is fully characterised by its components and does not require for access to measurements.

The structure

This post is structured as follows:

  1. We define the linear state-space model (SSM), outline assumptions for a well-specified SSM, and illustrate this with a simple two-dimensional example.

  2. We derive the best linear unbiased predictor (BLUP) of the signal at time t, given measurements up to time j, covering filtering, prediction, and smoothing.

  3. We explore key SSM properties: we examine forecasting, derive innovations (residuals) and their variance, and present covariance formulas crucial for closed-form computation of the BLUP.

  4. We derive the error variance covariance (EVC) matrix for one-step-ahead prediction and filtering, showing that uncertainty in an SSM can be computed without observed measurements, highlighting the known result that classical “optimal filters” are not adaptive.

  5. We review autoregressive (AR) and moving average (MA) models, showing their SSM form, and extend this to ARMA models.

  6. Finally, we conclude with a recap and discussion.


The state-space model

Consider a $d$-dimensional signal-plus noise (SPN) measurement process, i.e., $Y_t = F_t + E_t$, where $Y_t$ is the measurement, $F_t$ is the signal, and $E_t$ is the noise.

In an SSM, we decompose the signal $F_t$ as the product of a $d\times m$ projection matrix $\vH_t$ times a latent variable $\Theta_t \in {\mathbb R}^m$. Using this decomposition, the SPN process gets written as $$ \tag{M.1} \begin{aligned} F_t &= \vH_t\,\Theta_t,\\ Y_t &= F_t + E_t. \end{aligned} $$ Furthermore, the latent variable $\Theta_t$ is taken to be time-dependent and evolving according the equation $$ \Theta_t = \vF_{t-1}\,\Theta_{t-1} + U_{t-1}, $$ where, $\vF_t$ is an $m\times m$ transition matrix, and $U_t$ is a zero-mean random variable with $\var(U_t) = Q_t$. Finally, the SPN process under the SSM assumption at time $t$ takes the form $$ \tag{SSM.1} \boxed{ \begin{aligned} \Theta_t &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1},\\ Y_t &= \vH_t\,\Theta_t + E_t. \end{aligned} } $$

The SSM evolves over time as follows $$ \begin{array}{c|ccc} \text{time} & \text{latent} & \text{signal} & \text{measurement}\\ \hline 0 & \Theta_0 & - & - \\ 1 & \Theta_1 = \vF_0\,\Theta_0 + U_0 & F_1 = \vH_1\,\Theta_1 & Y_1 = F_1 + E_1\\ 2 & \Theta_2 = \vF_1\,\Theta_1 + U_1 & F_2 = \vH_2\,\Theta_2 & Y_2 = F_2 + E_2\\ \vdots & \vdots & \vdots & \vdots \\ T & \Theta_T = \vF_{T-1}\,\Theta_{T-1} + U_{T-1} & F_T = \vH_t\,\Theta_T & Y_T = F_T + E_T\\ \end{array} $$

The term $(\text{SSM.1})$

  1. decouples the dimension of the modelled signal $\Theta_t \in \mathbb{R}^m$, to that of the observed process $Y_t \in \mathbb{R}$ and
  2. imposes a time-dependent structure over the latent state $\Theta_t$.

A 2d-tracking example

Consider the two-dimensional linear dynamical system $$ \begin{aligned} &\vF_t = \begin{bmatrix} 1 - dt + \frac{dt^2}{2} & -dt + dt^2 \\ 6\,dt - 6\,dt^2 & 1 - dt + \frac{dt^2}{2} \end{bmatrix}, &Q_t = 0.1^2\,\vI_2,\\ &\vH_t = \vI_2, &R_t = 0.05^2\,\vI_2, \end{aligned} $$ with $dt = 0.005$, $\Theta_0 \sim {\cal U}[0.5, 1.2]^2$, and $\vI_2$ an identity matrix of size $2$.

A sample of the latent process $\Theta_{1:T}$ and measurements $Y_{1:T}$ is shown in the figure above. There, the dots represent the sampled measurements $y_{1:t}$ and the solid line represents the (unknown) sampled latent process $\theta_{1:t}$.


The best linear unbiased predictor: from signals to latent variables

Let $F_t \in \reals^d$ be the signal process and let $ Y_{1:j} \in \reals^{(j\,d)\times 1} $ be a block vector of measurements. Our goal is to estimate the block matrix $ \vA\in \reals^{d\times(j\,d)} $ that minimises the expected L2 distance between the signal $F_t$ and the measurements $Y_{1:j}$. Following F01, Proposition 1 this matrix is given by

$$ \vA^* = \argmin_{\vA} \mathbb{E}[\|F_t - \vA\,Y_{1:j}\|_2^2] = \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}, $$ where $\cov(F_j, Y_{1:j}) \in \reals^{d \times (j\,d)}$ and $\var(Y_{1:j})^{-1} \in \reals^{(j\,d)\times (j\,d)}$.

Having found $\vA^*$, the estimate $$ Y_{t|j} = \vA^*\,Y_{1:j} = \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j} $$ is called the best linear unbiased predictor (BLUP) of the signal $F_t$ given the measurements $Y_{1:j}$. The name of this estimate depends on the relationship between $t$ and $j$:

  • filtering: when $t = j$,
  • prediction: when $t > j$, and
  • smoothing: when $t \leq j = T$.

Because $F_t = \vH_t\Theta_t$, it follows that $$ \begin{aligned} \vA^* &= \cov(F_j, Y_{1:t})\,\var(Y_{1:t})^{-1}\\ &= \cov(\vH_j\,\Theta_j, Y_{1:t})\,\var(Y_{1:t})^{-1}\\ &= \vH_j\,\cov(\Theta_j, Y_{1:t})\,\var(Y_{1:t})^{-1}. \end{aligned} $$ Coincidentally, the term $\cov(\Theta_j, Y_{1:t})\,\var(Y_{1:t})^{-1}$ corresponds to the $m\times (j d)$ matrix $\vB^*$ such that $$ \vB^* = \argmin_{\vb} \mathbb{E}\left[\|\Theta_t - \vB\,Y_{1:j}\|_2^2\right] = \cov(\Theta_t, Y_{1:j})\,\var(Y_{1:j})^{-1}. $$ As a consequence, the BLUP of the latent variable $\Theta_t$ given measurements $Y_{1:j}$ is $$ \tag{BLUP.1} \boxed{ \Theta_{t|j} = \vB^*\,Y_{1:j} = \cov(\Theta_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}. } $$

This result shows that the BLUP for the latent state is linked to the BLUP for the signal. Therefore, we will focus on deriving BLUPs for the latent state, from which the BLUP for the signal can be easily obtained. This relationship is highlighted in the remark below.

🔎 Tip
In a linear SSM, the BLUP of the signal and the BLUP of the latent variable are equivalent up to a matrix that projects from latent space to signal space. That is $F_{t|j} = \vH_t\,\Theta_{t|j}.$

Building on the previous post, our next step—having derived $({\rm BLUP.1})$—is to express the BLUP in terms of innovations rather than measurements. As the following proposition shows, expressing the BLUP in terms of innovations leads to results in a sum of $j$ terms, each requiring the inversion of a $(d \times d)$ matrix. This approach is notably more efficient than inverting a $(j\,d) \times (j\,d)$ matrix at every time frame $j$ of interest.

Proposition 1: BLUP of the latent state and innovations

The best linear unbiased predictor (BLUP) of the latent variable $\Theta_t$, given innovations ${\cal E}_{1:j}$, is $$ \tag{BLUP.2} \boxed{ \begin{aligned} \Theta_{t|j} &= \sum_{k=1}^j \vK_{t,k}\, {\cal E}_k,\\ \vK_{t,j} &= \cov(\Theta_t, {\cal E}_j)\,\var({\cal E}_j)^{-1},\\ {\cal E}_j &= Y_j - \sum_{\ell=1}^{j-1} \cov(Y_j, {\cal E}_\ell)\,\var({\cal E}_\ell)^{-1}\,{\cal E}_\ell. \end{aligned} } $$ Furthermore, the error variance-covariance (EVC) matrix between the BLUP and the latent variance takes the form $$ \tag{EVC.2} \Sigma_{t|j} := \var(\Theta_t - \Theta_{t|j}) = \var(\Theta_t) - \sum_{k=1}^j \vK_{t,k}\,\var({\cal E}_k)\,\vK_{t,k}^\intercal. $$ Here, $\cov(\Theta_t, {\cal E}_k) \in \reals^{m\times d}$, $\var({\cal E}_k)\in \reals^{d\times d}$, and $\Sigma_{t|j} \in \reals^{m\times m}$.

See the Appendix for a proof.

As we see, the main quantities to compute the BLUP are

  1. the innovations ${\cal E}_\ell$,
  2. the variance of the innovations $\var({\cal E}_k)$, and
  3. the covariance between the latent states and innovations $\cov(\Theta_t, {\cal E}_k)$.

In the rest of this post, we show that each of these terms can be computed in closed form.


Analytical properties of state-space models

In this section, we derive the terms to estimate the BLUP and the EVC. These building blocks will later allows us to derive filtering methods.

We begin by writing down the assumptions for an SSM.

Assumptions for an SSM

As we will see, the computation of the BLUP for an SSM is relatively straightforward thanks to its mathematical properties. However, for an SSM to be well-specified, we require the following list of assumptions about the data-generating process:

  • (A.1) $\var(\Theta_0) = \Sigma_{0|0}$ — known and positive definite,
  • (A.2) $\var(U_t) = Q_t$ — known and positive semi-definite,
  • (A.3) $\var(E_t) = R_t$ — known and positive definite,
  • (A.4) $\cov(E_t, E_s) = 0$ for $t\neq s$,
  • (A.5) $\cov(U_t, U_s) = 0$ for $t \neq s$,
  • (A.6) $\cov(E_t, U_s) = 0$ for all $t,s$,
  • (A.7) $\cov(E_t, \Theta_0) = 0$,
  • (A.8) $\cov(U_t, \Theta_0) = 0$,
  • (A.9) $\vH_t$ — known $d\times m$ projection matrix, and
  • (A.10) $\vF_t$ — known $m\times m$ transition matrix.

Conditions (A.2, A.3, A.6 — A.10) hold for $t \in {\cal T}$.

Some important assumptions to note are:

  • (A.2) and (A.3) — the SSM is a heterogenious process with known time-varying variances.3
  • (A.3), (A.4), and (A.5) — the observation noise and the signal noise are all uncorrelated through time and through states.
  • (A.7) and (A.8) — the initial condition does not affect future noise terms

As a consequence of assumptions (A.1) - (A.10), we arrive at the following basic properties of an SSM.

Proposition 2.1 — basic properties of SSMs

  • (P.1) $\cov({\cal E}_t, {\cal E}_s) = 0$ for all $t \neq s$,
  • (P.2) $\cov(E_t, \Theta_s) = 0$ for all $(t,s) \in {\cal T}^2$.
  • (P.3) $\cov(\Theta_s, U_t) = \cov(Y_s, U_t) = \cov({\cal E}_s, U_t) = 0$ for $s \leq t$.
  • (P.4) $\cov({\cal E}_s, E_t) = 0$ for $s < t$.

For a proof, see the Appendix.

Proposition 2.1 offers the following intuitive insights:

  • (P.1) — innovations are uncorrelated.
  • (P.2) — the latent variable and the observation noise are uncorrelated Meaning that errors in measurements space do not affect the path of the latent process,
  • (P.3) — latent variables, measurements, and innovations do not affect current and future noise dynamics, and
  • (P.4) — innovations do not affect future observation noise terms.

Proposition 2.1 provides the foundation for four key results, including the estimation and prediction of the BLUP and the EVC, the derivation of the form of innovations in an SSM, the analytical expression for the variance of the innovation, and the covariance between the latent state and the innovations.

We outline these results below.

Proposition 2.2 — forecasting the filtered BLUP

For $\ell \geq 1$, the BLUP of $\Theta_{t + \ell}$ given $Y_{1:t}$ is $$ \Theta_{t+\ell|t} = \vF_{t+\ell-1}\,\vF_{t+\ell-2}\,\ldots\,\vF_{t}\,\Theta_{t|t} $$

See the appendix for a proof.

🔎 Tip
Forecasting the BLUP in an SSM amounts to pre-multiplying the product of future (and known) transition matrices.

Proposition 2.3 – error variance covariance of the forecasted filtered BLUP

For $k \geq 1$, the EVC of $\Theta_{t + k}$ given the forecasted BLUP $\Theta_{t+k | t}$ is $$ \var\left(\Theta_{t+k} - \Theta_{t+k|t}\right) = \left[\prod_{\ell=1}^k \vF_{t+k-\ell}\right]\,\Sigma_{t|t}\,\left[\prod_{\ell=1}^k \vF_{t+k-\ell}\right]^\intercal + \sum_{\ell=1}^{k}\,\left[\prod_{j=1}^{\ell-1}\vF_{t+k-j}\right]\,Q_{t+k-\ell}\,\left[\prod_{j=1}^{\ell-1}\vF_{t+k-j}\right]^\intercal, $$ where we use the convention $\prod_{j=1}^0 \vA_{j} = \vI$, for some matrix $\vA_j$.

See the Appendix for a proof.

Proposition 2.4 — innovations under an SSM

The innovation process ${\cal E}_{1:T}$ in an SSM takes the form $$ \begin{aligned} {\cal E}_t &= Y_t - \hat{Y}_{t|t-1}\\ &= \vH_t\,(\Theta_t - \Theta_{t|t-1}) + E_t. \end{aligned} $$ with $\hat{Y}_{t|t-1} = \vH_t\,\Theta_{t|t-1}$.

See the Appendix for a proof.

🔎 Tip
In an SSM, the innovation is the one-step-ahead prediction error (also known as the residual).

Proposition 2.5 — variance of innovation

Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, for $t \in {\cal T}$, $$ S_t = \var({\cal E}_t) = \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t. $$

See the Appendix for a proof.

Theorem 2.6 — the fundamental covariance structure

Let $\Sigma_{1|0} = \var(\Theta_1)$, then $$ \cov(\Theta_t, {\cal E}_j) = \begin{cases} {\color{orange} \overleftarrow\vF_{t-1:j}}\,\Sigma_{j | j-1}\vH_j^\intercal & j \leq t-1,\\ \Sigma_{t|t-1}\,\vH_t^\intercal & j = t,\\ \Sigma_{t|t-1}\,{\color{#00B7EB} \overrightarrow\vM_{t+1:j}^\intercal}\,\vH_j^\intercal & j \geq t+1. \end{cases} $$ Here, $$ \begin{aligned} {\color{orange}\overleftarrow{\vF}_{t-1:j}} &= \prod_{\ell=1}^{t-j} \vF_{t-\ell}, & \text{ for } j \leq t-1,\\ {\color{#00B7EB} \overrightarrow{\vM}_{t+1:j}^\intercal} &= \prod_{\ell=1}^{j-t} \vM_{t+\ell}^\intercal, & \text{ for } j \geq t+1, \end{aligned} $$ with $$ \vM_t = \vF_t\,(\vI - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t). $$

See the Appendix for a proof.


Quantifying uncertainty: the error-variance covariance matrix

The measure of uncertainty in an SSM is represented through the error variance-covariance (EVC) matrix. Two of the most common measures of uncertainty are

  1. the predicted EVC: the error variance between $\Theta_t$ and BLUP $\Sigma_{t|t-1} = \Theta_{t|t-1}$: $\var(\Theta_t - \Theta_{t|t-1})$,
  2. the updated EVC: the error variance between $\Theta_t$ and $\Sigma_{t|t} = \Theta_{t|t}$: $\var(\Theta_t - \Theta_{t|t})$.

The first option quantifies the uncertainty about the value of the next state, while the second option quantifies the uncertainty about the value of the current state. As we will see in a later post, these two quantities are important to derive the so-called Kalman filter.

The following three propositions provide an analytic form for the predicted and updated EVC.

Proposition 3.1 — initial ECV

Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, $$ \Sigma_{1|0} = \vF_0\,\Sigma_{0|0}\,\vF_0^\intercal + Q_0. $$

See the Appendix for a proof.

Proposition 3.2 — predicted EVC

Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, for $t \in {\cal T}$, $$ \Sigma_{t|t-1} = \vF_t\,\Sigma_{t-1|t-1}\,\vF_t^\intercal + Q_{t-1}. $$ See the Appendix for a proof.

Proposition 3.3 — updated EVC

Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, for $t \in {\cal T}$, $$ \Sigma_{t|t} = \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal, $$ with $\vK_t = \Sigma_{t|t-1}\,\vH_t\,S_t^{-1}$.

Alternatively, $$ \Sigma_{t|t} = \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\, S_t^{-1}\,\vH_t\,\Sigma_{t|t-1} $$

See the Appendix for a proof.

From a Bayesian filtering perspective, the error variance-covariance matrix $\Sigma_{t|t}$ is known as the posterior covariance, which, congruent with the intuition developed in this section, represents the uncertainty about the value of the latent variable $\Theta_t$—it measures how far off can we expect our estimate for the latent state to be.

Pseudocode for the error covariance matrix

Following Propositions 3.1, 3.2 and 3.3., the predicted and filtered EVC can be written in a sequential manner. We show ths in the pseudocode below $$ \boxed{ \begin{array}{l|l} \Sigma_{1|0} = \vF_0\,\Sigma_{0|0}\,\vF_0 + Q_0 & \var(\Theta_1)\\ S_1 = \vH_1\,\Sigma_{1|0}\,\vH_t^\intercal + R_1 & \var({\cal E}_1)\\ \Sigma_{1|1} = \Sigma_{1|0} - \Sigma_{1|0}\,\vH_1^\intercal\,S_1^{-1}\,\vH_1\, \Sigma_{1|0} & \var(\Theta_1 - \Theta_{1|1})\\ \text{for } t=2, \ldots, T:\\ \quad \Sigma_{t|t-1} = \vF_t\,\Sigma_{t-1|t-1}\,\vF_t^\intercal + Q_t & \var(\Theta_t - \Theta_{t|t-1})\\ \quad S_t = \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t & \var({\cal E}_t)\\ \quad \Sigma_{t|t} = \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1} & \var(\Theta_t - \Theta_{t|t}) \end{array} } $$

🔎 Tip
Uncertainty estimation in a linear SSM does not depend on the seen data and is fully characterised the known building blocks of the SSM, namely $\vF_t$, $\vH_t$, $Q_t$, and $R_t$. The uncertainty is considered misspecified if the EVC is estimated using components that differ from the true model, for example, if the EVC is modelled using $\bar\vF_t \neq \vF_t$.

The autoregressive moving average model (ARMA) as a state-space model

We conclude this post presenting a classical example of time-series forecasting: the unidimensional ARMA model in state-space form.4 As we will see, these models can be written in the form $$ \begin{aligned} \tag{ARMA.0} \Theta_t &= \vF\,\Theta_{t-1} + \vT\,E_{t-1}\\ Y_t &= \vH_t\,\Theta_t + E_t, \end{aligned} $$ with $U_t = \vT\,E_{t-1}$ the dynamics of the model.

For a full code example, see this notebook.

Sampling an SSM

To sample a step of the SSM above, we require the matrices $\vF$, $\vH$, $\vT$, and $\vR$. This can be easily written as a jax step as shown in code below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def step_ssm(state, key, F, H, T, R):
    theta, noise = state

    # Build next state
    theta = F @ theta + T @ noise
    noise = jax.random.normal(key) * jnp.sqrt(R)

    # Build observation
    y = H @ theta + noise

    out = {
        "y": y,
        "theta": theta
    }

    state_next = (theta, noise)
    return state_next, out

Definition: a moving average (MA) process

A time series $Y_{1:T}$ is said to be an $m$-th order moving average process — MA($m$) — if there are known $a_1, \ldots, a_m$ coefficients such that $$ \tag{MA.1} \boxed{ Y_t = \sum_{k=1}^{m}a_k\,E_{t - k} + E_t, } $$ for uncorrelated zero-mean random variables $E_{1}, \ldots, E_T$ with common covariance $R = \var(E_t)$ for $t \in {\cal T}$, and initialising terms $E_{-1} = \ldots = E_{-m} = 0$.

The MA process as a state-space model

To write $({\rm MA.1})$ as an SSM, define the $m$-dimensional random vector $$ \Theta_{t} = \begin{bmatrix} E_{t-1}\\ \vdots\\ E_{t-m} \end{bmatrix}, $$

Next, define the $(m\times m)$ transition matrix $$ \vF_{\rm MA} = \begin{bmatrix} 0 & 0 & \cdots & 0 & 0\\ 1 & 0 & \cdots & 0 & 0\\ 0 & 1 & \cdots & 0 & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & \cdots & 1 & 0 \end{bmatrix}. $$ Let $$ U_{t-1} = {\cal T}_m\,E_{t-1}, $$ be the dynamics of the model with $$ {\cal T}_m = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}. $$

Next, the projection matrix takes the form $$ \vH_{\rm MA} = \begin{bmatrix} a_1& a_2& \ldots& a_m \end{bmatrix}. $$

Finally, the SSM form for an MA($m$) process takes the form $$ \tag{MA.2} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ Y_t &= \vH_{\rm MA}\,\Theta_t + E_t. \end{aligned} } $$ The term $\vF_{\rm MA}\,\Theta_{t-1}$ maintains the most recent $m-1$ noise terms and $U_{t-1} = {\cal T}_m\,E_{t-1}$ includes the last error term.

Because the SSM components are fixed and specified at the beginning of the experiment, we can easily write a function that defines each of these components by passing the vector of elements $\vH_{\rm MA}$:

1
2
3
4
5
6
7
8
def init_ma_components(H_ma):
    H_ma = jnp.atleast_1d(H_ma)
    m = len(H_ma.ravel())
    # Create one-off diagonal matrix of ones
    F_ma = jnp.diagflat(jnp.ones(m-1), k=-1)
    # Create column vector of zeros and set one to the first entry
    T_m = jnp.zeros(m).at[0].set(1)[:, None]
    return H_ma, F_ma, T_m

Example — MA(1)

The following plot shows a sample of an MA($m$) process where $a_1 = a_2 = \ldots = a_m = 1$ and we vary $m$.

ma1-example

As we see, higher values of $m$ result in a more smooth process, while lower values of $m$ result in a more rough process.

Next, we show the autocorrelation of the sample above. Here, the autocorrelation is defined by $$ \tag{ACorr.1} \rho_{k} = \frac{\cov(Y_t, Y_{t+k})}{\sqrt{\var(Y_t)}\,\sqrt{\var(Y_{t+k})}}. $$ ma1-example

Here, we observe that longer values of $m$ result in higher autocorrelation, which corresponds to a longer dependence between the current and future values.

Definition: An autoregressive (AR) process

A time series $Y_{1:T}$ is said to be an $r$-th order autoregressive process—AR($r$)—if there are known $b_1, \ldots, b_r$ coefficients such that

$$ \tag{AR.1} \boxed{ Y_t = \sum_{k=1}^r b_k\,Y_{t - k} + E_t, } $$ with $\var(E_{1:T}) = R\,\vI_T$ and initial terms $Y_{-1} = \ldots = Y_{-r} = 0$.

The AR process as a state-space model

To write $({\rm AR.1})$ as an SSM, define the $r$-dimensional state vector $$ \Theta_t = \begin{bmatrix} Y_{t-1}\\ \vdots \\ Y_{t-r} \end{bmatrix}, $$ for $t \geq 0$. Observe that $\Theta_0 = \vzero$.

Next, define the $r\times r$ transition matrix that maintains the previous $y_{t-2}, \ldots, y_{t-r-1}$ observations and builds the noisless observation $y_{t-1}$: $$ \vF_{\rm AR} = \begin{bmatrix} b_1 & b_2 & \cdots & b_{r-1} & b_r\\ 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & \cdots & 1 & 0 \\ \end{bmatrix}, $$ and $$ U_{t-1} = {\cal T}_r\,E_{t-1}. $$ Next, the projection matrix takes the form $$ \vH_{\rm AR} = \begin{bmatrix} b_1, \ldots, b_r \end{bmatrix}. $$

We write the AR($r$) process as an SSM as $$ \tag{AR.2} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm AR}\,\Theta_{t-1} + {\cal T}_r\,E_{t-1},\\ Y_t &= \vH_{\rm AR}\,\Theta_t + E_t. \end{aligned} } $$

The term $\vF_{\rm AR}\,\Theta_{t-1}$ maintains the most recent $(r-1)$ measurements and builds the noisless observation at time $(t-1)$. Next, $U_{t-1} = {\cal T}_m\,E_{t-1}$ includes the noise term at time $(t-1)$.

Similar to the MA case, we can define a function that defines the components of the AR process, provided we are given $\vH_{\rm AR}$:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def init_ar_components(H_ar):
    H_ar = jnp.atleast_1d(H_ar)
    r = len(H_ar.ravel())
    # Create one-off diagonal matrix of ones
    F_ar = jnp.diagflat(jnp.ones(r-1), k=-1)
    # Set first row to be current H_AR matrix
    F_ar = F_ar.at[0].set(H_ar)
    # Create column vector of zeros and set one to the first entry
    T_r = jnp.zeros(r).at[0].set(1)[:, None]
    return H_ar, F_ar, T_r

Example — AR(1)

The following plot shows a sample of an AR($1$) process where $b_1 \in [0,1]$. AR(1) samples with varying parameter As we see, values of $b_1$ close to $0$ result in a mean reverting process, while values of $b_1$ close to $1$ result in a trend following (or momentum) process.

Next, we plot the autocorrelation of the AR($1$) sample above. AR(1) autocorrelation As expected, negative values of $b_1$ result in a mean reverting process, whose autocorrelation oscillates, while positive values of $b_1$ result in a trend-following process, whose autocorrelation is positive.

Definition: An autoregressive moving average (ARMA) model.

A time series $Y_{1:T}$ is said to be an $r$-th order autoregressive $m$-th order moving average process — ARMA($r$, $m$) — if there are known $(b_1, \ldots, b_r)$ and $(a_1, \ldots, a_m$) coefficients such that $$ \tag{ARMA.1} \boxed{ Y_t = \sum_{k=1}^{r}b_k\,Y_{t - k} + \sum_{j=1}^m\,a_j\,E_{t-j} + E_t } $$ with $\var(E_{1:T}) = R\,\vI_T$ and initial terms $Y_{-1} = \ldots = Y_{-r} = E_{-1} = \ldots = E_{-m} = 0$.

Following $({\rm AR.2})$ and $({\rm MA.2})$, we rewrite the ARMA($r$, $m$) process as the SSM $$ \tag{ARMA.2} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm ARMA}\,\Theta_{t-1} + {\cal T}_{r,m}\,E_{t-1},\\ Y_t &= \vH_{\rm ARMA}\,\Theta_t + E_t, \end{aligned} } $$

with components. $$ \begin{aligned} \vF_{\rm ARMA} &= \begin{bmatrix} \vF_{\rm AR} & \vzero_{r,m} \\ \vzero_{m,r} & \vF_{\rm MA} \end{bmatrix},\\ \vH_{\rm ARMA} &= \begin{bmatrix} \vH_{\rm AR} & \vH_{\rm MA} \end{bmatrix},\\ {\cal T}_{r,m} &= \begin{bmatrix} {\cal T}_r \\ {\cal T}_m \end{bmatrix}. \end{aligned} $$

Programmatically, to define an ARMA process, we simply stack the building blocks defined above

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def build_arma_components(H_ar, H_ma):
    H_ar, F_ar, T_ar = init_ar_components(H_ar)
    H_ma, F_ma, T_ma = init_ma_components(H_ma)

    r, m = len(T_ar), len(T_ma)
    F_arma = jnp.zeros((r + m, r + m))
    F_arma = F_arma.at[:r,:r].set(F_ar)
    F_arma = F_arma.at[r:, r:].set(F_ma)

    H_arma = jnp.c_[H_ar, H_ma]

    T_arma = jnp.r_[T_ar, T_ma]
    return H_arma, F_arma, T_arma

A note on misspecification

The form of the dynamics noise $U_t$ in $({\rm ARMA.0})$ is such that $$ \cov(U_{t+1}, E_s) = \cov(\vT\,E_{t}, E_{s}) = \vT\,\var(E_{t})\,\mathbb{1}(t = s), $$ which directly conflicts with assumption (A.6).

An alternative approach reformulates the SSM so that the only source of noise is through the dynamics $U_t$, so that $R_t = 0$.5 This version, however, contradicts assumption (A.3), which stipulates that $R_t$ must be positive definite. Despite this contradiction, it remains possible to work with this alternative formulation, provided that $\vH_t^\intercal,\Sigma_{t|t-1},\vH_t$ remains positive definite, which is necessary because we need to invert this quantity.


Closing thoughts: a computational argument for state-space models

In the previous post, we introduced a data-driven approach to filtering, prediction, and smoothing6. By decorrelating measurements in a pre-processing step, we obtained a linear (in-time) algorithm for estimating these quantities.

However, this approach relies on three limiting assumptions:

  1. the experiment’s length must be fixed and known,
  2. we need multiple simulations of the true signal, and
  3. the signal’s dimension must match that of the measurements.

These constraints are often impractical. In real-world scenarios, experiment durations are unknown (e.g., robot navigation or stock trading); we rarely have multiple observations of an underlying signal; and incorporating external data (e.g., news sentiment in stock trading) requires flexibility in dimensions.

SSMs overcome these computational challenges by:

  1. running indefinitely,
  2. requiring only one sample run for estimation, and
  3. decoupling the signal and measurement dimensions via a latent process.

In the next post, we will use the properties of SSMs derived here to develop the classical Kalman filtering algorithm.

Despite their computational advantages, SSMs have limitations. Notably, as we have seen, obtaining a well-specified estimate of uncertainty requires full knowledge of the system dynamics—a constraint that may not always hold in practical applications.


Appendix

Some helpful quantities

The latent variable at time $t$ can be decomposed into the initial state $\Theta_0$ and the noise terms $(U_{1}, \ldots, U_{t-1})$ as $$ \tag{ssm.aux.1} \begin{aligned} \Theta_t &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1}\\ &= \left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j}, \end{aligned} $$ with the convention $\prod_{a}^b \vF_a = 1$ for $b < a$. As a consequence, the observation at time $t$ takes the form $$ \begin{aligned} \tag{ssm.aux.2} Y_t &= \vH_t\,\Theta_t + E_t\\ &= \vH_t\,\left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \vH_t\,\sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j} + E_t. \end{aligned} $$

Recall from (I.3) shown in filtering-notes-i that $Y_{1:T} = \vL\,{\cal E}_{1:T} \implies {\cal E}_t = \sum_{i=1}^t\vL_{t,i}^{-1}\,Y_{i}$, which follows from $\vL$ being a lower-triangular7 matrix. Furthermore, the cholesky decomposition for the measurements takes the form $\var(Y_{1:T}) = \vL\,S\,\vL^\intercal$, with $S$ a diagonal matrix.

Proof of Proposition 1

Here we show Proposition 1

$$ \begin{aligned} \Theta_{t|j} &=\cov(\Theta_t, Y_{1:j})\var(Y_{1:j})^{-1}\,Y_{1:j}\\ &=\cov(\Theta_t, \vL{\cal E}_{1:j})\var(\vL{\cal E}_{1:j})^{-1}\,\vL{\cal E}_{1:j}\\ &= \cov(\Theta_t, {\cal E}_{1:j})\,\vL^\intercal\left[ \vL\var({\cal E}_{1:j}) \vL^\intercal\right]^{-1}\vL{\cal E}_{1:j}\\ &= \cov(\Theta_t, {\cal E}_{1:j})\,\vL^\intercal\vL^{-\intercal}\var({\cal E}_{1:j})^{-1}\vL^{-1}\vL{\cal E}_{1:j}\\ &= \cov(\Theta_t, {\cal E}_{1:j})\,{\rm diag}(S_1, \ldots, S_j)\,{\cal E}_{1:j}\\ &= \begin{bmatrix} \cov(\Theta_1, {\cal E}_1) & \ldots & \cov(\Theta_1, {\cal E}_j) \end{bmatrix} \begin{bmatrix} S_1 & 0 & \ldots & 0\\ 0 & S_2 & \ldots & 0\\ 0 & 0 & \vdots & S_j\\ \end{bmatrix}^{-1} \begin{bmatrix} {\cal E}_1 \\ \vdots \\ {\cal E}_j \end{bmatrix}\\ &= \begin{bmatrix} \cov(\Theta_t, {\cal E}_1) & \ldots & \cov(\Theta_t, {\cal E}_j) \end{bmatrix} \begin{bmatrix} S_1^{-1}\,{\cal E}_1 \\ \vdots \\ S_j^{-1}{\cal E}_j \end{bmatrix}\\ &=\sum_{k=1}^j \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k\\ &=\sum_{k=1}^j \vK_{t,j}\,{\cal E}_k, \end{aligned} $$ with $\vK_{t,j} = \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}$, $S_t = \var({\cal E}_t)$, and ${\cal E}_j$ takes the form $({\rm I.1})$ in F.01, innovations.

Next, to derive the error variance-covariance (EVC) matrix, note that $$ \begin{aligned} \Sigma_{t|j} &= \var(\Theta_t - \Theta_{t|j})\\ &= \var\left(\Theta_t - \sum_{k=1}^j \vK_{t,k}\,{\cal E}_k\right)\\ &= \var(\Theta_t) + \var\left(\sum_{k=1}^j \vK_{t,j}\,{\cal E}_k\right) - \cov\left(\Theta_t,\,\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k\right) - \cov\left(\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k,\,\Theta_t\right). \end{aligned} $$

Where, $$ \begin{aligned} \var\left(\sum_{k=1}^j \vK_{t,j}\,{\cal E}_k\right) &= \cov\left(\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k,\,\sum_{\ell=1}^j \vK_{t,\ell}\,{\cal E}_\ell\right)\\ &= \sum_{k=1}^j\sum_{\ell=1}^j \vK_{t,k}\,\cov({\cal E}_k,\, {\cal E}_{\ell})\,\vK_{t,\ell}^\intercal\\ &= \sum_{k=1}^j \vK_{t,k}\cov({\cal E}_{k},\,{\cal E}_{k})\,\vK_{t,k}^\intercal\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E}_{k})\,\vK_{t,k}^\intercal, \end{aligned} $$ $$ \begin{aligned} \cov\left(\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k,\,\Theta_t\right) &= \sum_{k=1}^j \vK_{t,k} \cov({\cal E}_k,\,\Theta_t)\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\var({\cal E}_k)^{-1}\cov({\cal E}_k,\,\Theta_t)\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\big(\cov(\Theta_t,\,{\cal E}_k)\var({\cal E}_k)^{-1}\big)^\intercal\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal, \end{aligned} $$ and similarly $$ \cov\left(\Theta_t,\,\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k\right) = \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal. $$

So that $$ \begin{aligned} \Sigma_{t|j} &= \var(\Theta_t) + \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal - \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal - \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal\\ &= \var(\Theta_t) - \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal. \end{aligned} $$

$$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 2.1

Here we show the properties for an SSM outlined in Proposition 2.1.

Proof of (P.1)

Because an SSM is a signal plus noise model, it follows from F.01, Proposition 2 that innovations are uncorrelated.

Proof of (P.2)

The proof follows from $({\rm ssm.aux.1})$, and assumptions (A.6) and (A.7).

Let $s,t \in {\cal T}^2$, then $$ \begin{aligned} \cov\left(\Theta_s, E_t\right) &= \cov\left( \left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j}, E_t \right)\\ &= \left(\prod_{k=1}^{t}\vF_{t-k}\right) \underbrace{\cov(\Theta_0, E_t)}_{{\rm (A.6)} = 0} + \sum_{j=1}^t\left(\prod_{k=1}^{j-1}\vF_{t-k}\right) \underbrace{\cov(U_{t-j}, E_t)}_{{\rm (A.7)} = 0}\\ &= 0. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

Proof of (P.3)

For all results below, we take $s \leq t$.

First, we show $\cov(\Theta_s, U_t) = 0$, i.e., the latent variable $\Theta_s$ is uncorrelated to future (and current) latent movements.

Following $({\rm ssm.aux.1})$, and assumptions (A.5) and (A.8), we obtain $$ \begin{aligned} \tag{P.3.1} \cov(\Theta_s, U_t) &= \cov\left( \left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j}, U_t \right)\\ &= \left(\prod_{k=1}^{t}\vF_{t-k}\right) \underbrace{\cov(\Theta_0, U_{t})}_{{\rm (A.8)} = 0} + \sum_{j=1}^t\left(\prod_{k=1}^{j-1}\vF_{t-k}\right) \underbrace{\cov(U_{t-j}, U_t)}_{{\rm (A.5)} = 0}\\ &= 0 \end{aligned} $$

Second, we show $\cov(Y_s, U_t) = 0$, i.e., the measurement $Y_s$ is uncorrelated to future (and current) latent movements.

The result follows from $({\rm P.3.1})$ and assumption (A.6) $$ \begin{aligned} \tag{P.3.2} \cov(Y_s, U_t) &= \cov(\vH_s\,\Theta_s + E_s, U_t)\\ &= \vH_s\,\cov(\Theta_s, U_t) + \cov(E_s, U_t)\\ &= 0. \end{aligned} $$

Third, we show $\cov({\cal E}_s, U_t) = 0$, i.e., the innovation ${\cal E}_s$ is uncorrelated to future (and current) latent movements $U_t$.

Suppose $s=k$ and $t \geq k$. Then, $$ \begin{aligned} \cov({\cal E}_k, U_t) &= \cov\left(\sum_{i=1}^T\,\vL_{k,i}^{-1}\,Y_i, U_t\right)\\ &= \sum_{i=1}^T\vL_{k,i}^{-1}\,\cov(Y_i, U_t)\\ &= \sum_{i=1}^k\vL_{k,i}^{-1}\,\cov(Y_i, U_t)\\ &= 0. \end{aligned} $$ Here, third equality follows form $\vL$ being lower-triangular—and hence $\vL^{-1}$ is lower-triangular—, and the last equality follows from $({\rm P.3.2})$. $$ \ \tag*{$\blacksquare$} $$

Proof of (P.4)

The property (P.4) follows from property (P.2), assumption (A.4), and the decomposition of innovations shown in F.01, (I.3): $$ \begin{aligned} \cov({\cal E}_t, E_s) &= \cov\left(\sum_{i=1}^t\vL_{t,i}^{-1}\,Y_{i}, E_s\right)\\ &= \sum_{i=1}^t\vL_{t,i}^{-1}\,\cov\left(Y_{i}, E_s\right)\\ &= \sum_{i=1}^t\vL_{t,i}^{-1}\,\cov\left(\vH_i\,\Theta_i + E_i, E_s\right)\\ &= \sum_{i=1}^t\vL_{t,i}^{-1}\,\big(\vH_i\,\cov\left(\Theta_i, E_s\right) + \cov\left(E_i, E_s\right)\big). \end{aligned} $$ Finally, $\cov\left(\Theta_i, E_s\right) = 0$ by property (P.2) and $ \cov\left(E_i, E_s\right) = 0$ by assumption (A.4). $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 2.2

Here we show Proposition 2.2.

Following $({\rm BLUP.1})$, the BLUP of $\Theta_{t+\ell}$ given $Y_{1:t}$ takes the form $$ \Theta_{t+\ell|t} = \cov\left(\Theta_{t+\ell}, Y_{1:t}\right)\,\var\left(Y_{1:t}\right)^{-1}\,Y_{1:t}. $$ Next, by definition, $$ \Theta_{t+\ell} = \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j}. $$ As a consequence, $$ \begin{aligned} &\cov\left(\Theta_{t+\ell}, Y_{1:t}\right)\\ &=\cov \left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j},\, Y_{1:t} \right)\\ &=\left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\cov\left(\Theta_t, Y_{1:t}\right) + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\cov\left(U_{t+\ell-j}, Y_{1:t}\right)\\ &=\left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\cov\left(\Theta_t, Y_{1:t}\right), \end{aligned} $$ where the last line follows from $({\rm P.3.2})$, i.e., $\cov(Y_s, U_t) = 0$ for $s \leq t$. Finally, $$ \begin{aligned} \Theta_{t+\ell|t} &= \cov(\Theta_{t+\ell}, Y_{1:t})\,\var(Y_{1:t})^{-1}\,Y_{1:t}\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\cov\left(\Theta_t, Y_{1:t}\right)\,\var(Y_{1:t})^{-1}Y_{1:t}\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_{t|t} \end{aligned} $$ as desired. $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 2.3

Here we show Proposition 2.3

By definition of the latent process and Proposition 2.2, $$ \begin{aligned} \Theta_{t+\ell} &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j},\\ \Theta_{t+\ell|t} &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_{t|t}. \end{aligned} $$ Next, the EVC matrix takes the form $$ \begin{aligned} \var\left(\Theta_{t+\ell} - \Theta_{t+\ell|t} \right) &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} - \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_{t|t} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right) \end{aligned} $$ By (P.3), $\cov(\Theta_s, U_t) = 0$ for $s \leq t$. Next, by (A.5), $\cov(U_t, U_s) = 0$ for all $t\neq s$. So that, $$ \begin{aligned} &\var\left(\Theta_{t+\ell} - \Theta_{t+\ell|t} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] \right) + \left( \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] \right) + \var\left( \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] \right) + \sum_{j=1}^\ell\var\left( \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right)\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\var\left( \left[\Theta_t - \Theta_{t|t}\right] \right) \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)^\intercal + \sum_{j=1}^\ell \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\var\left(U_{t+\ell-j}\right) \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)^\intercal\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right) \Sigma_{t|t} \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)^\intercal + \sum_{j=1}^\ell \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)Q_{t+\ell-j} \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)^\intercal. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 2.4

Here we show Proposition 2.4.

The proof follows from the definitions of the innovation process, the state-space model (SSM), and the best linear unbiased predictor (BLUP), as well as property (P.4).

We begin by expanding the innovation: $$ \begin{aligned} {\cal E}_t &= Y_t - \sum_{k=1}^{t-1}\cov(Y_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k\\ &= Y_t - \sum_{k=1}^{t-1}\cov(\vH_t\,\Theta_t + E_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k\\ &= Y_t - \vH_t\, \underbrace{\sum_{k=1}^{t-1}\cov(\Theta_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k}_{\Theta_{t|t-1}} - \sum_{k=1}^{t-1} \underbrace{\cov(E_t, {\cal E}_k)}_{=0 \text{ ({P.4})}}\,\var({\cal E}_k)^{-1}\,{\cal E}_k\\ &= Y_t - \vH_t\,\Theta_{t|t-1}. \end{aligned} $$

Next, because $Y_t = \vH_t\,\Theta_t + E_t$, the innovation takes the alternative form $$ \begin{aligned} {\cal E}_t &= Y_t - \vH_t\,\Theta_{t|t-1}\\ &= \vH_t\,\Theta_t + E_t - \vH_t\,\Theta_{t|t-1}\\ &= \vH_t\,(\Theta_t - \Theta_{t|t-1}) + E_t. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 2.5

Here we show Proposition 2.5

Let $t \geq 2$. Following Proposition 2.4, Property (P.2) and the definition of the innovation, we obtain $$ \begin{aligned} \var({\cal E}_t) &= \var\Big(\vH_t(\Theta_t - \Theta_{t|t-1}) + E_t\Big)\\ &= \vH_t\,\var(\Theta_t - \Theta_{t|t-1})\,\vH_t^\intercal + \var(E_t) - \vH_t\,\cov(\Theta_t, E_t) - \cov(E_t, \Theta_t)\,\vH_t^\intercal\\ &= \vH_t\,\var(\Theta_t - \Theta_{t|t-1})\,\vH_t^\intercal + \var(E_t)\\ &= \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

Proof of Theorem 2.6

Here we show Theorem 2.6.

Begin by noting that, for $t,j \in {\cal T}$, $$ \begin{aligned} \cov(\Theta_t, {\cal E}_j) &= \cov(\Theta_t,\, \vH_j\,[\Theta_j - \Theta_{j|j-1}] + E_j)\\ &= \cov(\Theta_t,\, \Theta_j - \Theta_{j|j-1})\,\vH_j^\intercal + \cov(\Theta_t, E_j)\\ &= \cov(\Theta_t - \Theta_{t|j-1},\, \Theta_j - \Theta_{j|j-1})\,\vH_j^\intercal. \end{aligned} $$ Where the last equality follows from Property (P.2) and the invariance property of covariances to fixed vectors.8 We now require to determine $$ \cov(\Theta_t - \Theta_{t|j-1},\, \Theta_j - \Theta_{j|j-1}) $$ for (I) $j = t$, (II) $j \leq t -1$, and (III) $j \geq t+1$.

Case $j=t$

Take $j=t$, then $$ \cov(\Theta_t - \Theta_{t|t-1},\, \Theta_t - \Theta_{t|t-1}) = \var(\Theta_t - \Theta_{t|t-1}) = \Sigma_{t|t-1}. $$ So that $ \cov(\Theta_t, {\cal E}_j) = \Sigma_{t|t-1}\,\vH^\intercal $ as desired.

Case $j\leq t-1$

Take $t - j = \ell \geq 1$, then $j + \ell = t$. By definition of the latent process and Proposition 2.2, $$ \begin{aligned} \Theta_t &= \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Theta_j + \sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,U_{t-k},\\ \Theta_{t|j-1} &= \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Theta_{j|j-1} \end{aligned} $$

Then, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|j-1},\, \Theta_j - \Theta_{j|j-1})\\ =&\cov\left( \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Theta_j + \sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,U_{t-k} - \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Theta_{j|j-1},\, \Theta_j - \Theta_{j|j-1} \right)\\ =&\cov\left( \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Theta_j - \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Theta_{j|j-1},\, \Theta_j - \Theta_{j|j-1} \right) + \cov\left(\sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,U_{t-k},\, \Theta_j - \Theta_{j|j-1} \right)\\ =&\left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\cov\left( \Theta_j - \Theta_{j|j-1},\, \Theta_j - \Theta_{j|j-1} \right) + \sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,\cov(U_{t-k}, \Theta_j)\\ =&\left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\var\left( \Theta_j - \Theta_{j|j-1} \right)\\ =&\left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Sigma_{j|j-1}, \end{aligned} $$ where $\cov(U_{t-k}, \Theta_j) = 0$ follows from property (P.3).

Hence $ \cov(\Theta_t, {\cal E}_j) = \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Sigma_{t|t-1}\,\vH^\intercal $ as desired.

Case $j \geq t+1$

Take $j - t = \ell \geq 1$, then $j = t + \ell$. Following $({\rm BLUP.2})$, we have $$ \tag{BLUP.3} \begin{aligned} \Theta_{t|t-1} &= \sum_{k=1}^{t-1}\cov(\Theta_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k,\\ \Theta_{t|t} &= \sum_{k=1}^{t-1}\cov(\Theta_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k + \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\\ &= \Theta_{t|t-1} + \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}. \end{aligned} $$ Next, denote $$ \vM_{t} = \vF_{t}[\vI - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t]. $$

The rest of the proof follows by induction.

Show $j=t+1$

First, suppose $j=t+1$. Then $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t}, \Theta_{t+1} - \Theta_{t+t1|t})\\ &=\cov(\Theta_t - \Theta_{t|t}, \vF_t\,\Theta_t + U_t - \vF_t\,\Theta_{t|t})\\ &= \cov(\Theta_t - \Theta_{t|t}, \Theta_t - \Theta_{t|t})\,\vF_t^\intercal, \end{aligned} $$ where we used Property (P.3). Next, following $({\rm BLUP.3})$, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t}, \Theta_t - \Theta_{t|t})\\ &= \var\Big(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\Big)\\ &= \var\Big(\Theta_t - \Theta_{t|t-1}\Big)\\ &\quad + \var\Big(\cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\Big)\\ &\quad - \cov\Big(\Theta_t - \Theta_{t|t-1}, \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\Big)\\ &\quad - \cov\Big(\cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t,\,\Theta_t - \Theta_{t|t-1})\\ &=\var\Big(\Theta_t - \Theta_{t|t-1}\Big)\\ &\quad + \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,\var({\cal E})\,\var({\cal E}_t)^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &\quad - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_{t})^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &\quad - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_{t})^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &=\var\Big(\Theta_t - \Theta_{t|t-1}\Big) - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_{t})^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &= \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}. \end{aligned} $$ From the last two equations, we have $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t}, \Theta_{t+1} - \Theta_{t+t1|t})\\ &=\cov(\Theta_t - \Theta_{t|t}, \Theta_t - \Theta_{t|t})\,\vF_t^\intercal\\ &= \Big(\Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}\Big)\,\vF_t^\intercal\\ &= \Sigma_{t|t-1}\,\Big(\vI -\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}\Big)\,\vF_t^\intercal\\ &= \Sigma_{t|t-1}\,\Big(\vF_t\,[\vI -\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}]^\intercal\Big)^\intercal\\ &= \Sigma_{t|t-1}\,\Big(\vF_t\,[\vI - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t]\Big)^\intercal\\ \end{aligned} $$

Suppose true for $j=t+k$

Suppose, for $j=t+k$, $$ \cov(\Theta_t - \Theta_{t|t+k-1}, \Theta_{t+k} - \Theta_{t+k | t+k-1}) = \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right). $$ So that $$ \cov(\Theta_t, {\cal E}_{t+k}) = \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right)\,\vH_{t+k}^\intercal $$

Show for $j=t+k+1$

Take $j = t+k+1$, then $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k+1} - \Theta_{t+k+1|t+k})\\ &=\cov(\Theta_t - \Theta_{t|t+k}, \vF_{t+k}[\Theta_{t+k} + U_{t+k} - \Theta_{t+k|t+k}])\\ &=\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\,\vF_{t+k}^\intercal. \end{aligned} $$ Next, following $({\rm BLUP.3})$ we have $$ \begin{aligned} \Theta_{t|t+k} &= \Theta_{t|t+k-1} + \cov(\Theta_t, {\cal E}_{t+k})\,\var({\cal E})^{-1}{\cal E}_{t+k} \\ \Theta_{t+k|t+k} &= \Theta_{t+k|t+k-1} + \cov(\Theta_{t+k}, {\cal E}_{t+k})\,\var({\cal E}_{t+k})^{-1}\,{\cal E}_{t+k}\\ \end{aligned} $$

So that, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\\ &=\cov( \Theta_t - \Theta_{t|t+k-1} + \cov(\Theta_t, {\cal E}_{t+k})\,\var({\cal E})^{-1}{\cal E}_{t+k},\, \Theta_{t+k} - \Theta_{t+k|t+k-1} + \cov(\Theta_{t+k}, {\cal E}_{t+k})\,\var({\cal E}_{t+k})^{-1}\,{\cal E}_{t+k})\\ &= \cov(\Theta_t - \Theta_{t|t+k-1},\,\Theta_{t+k} - \Theta_{t+k|t+k-1}) - \cov(\Theta_t, {\cal E}_{t+k})S_{t+1}^{-1}\,\cov({\cal E}_{t+k}, \Theta_{t+k}). \end{aligned} $$

Then, the induction hypothesis, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) - \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right)\,\vH_{t+k}^\intercal\,S_{t+k}^{-1}\,\vH_{t+k}\,\Sigma_{t+k|t+k-1}\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vI - \vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}\Sigma_{t+k|t+k-1}\Big) \end{aligned} $$

Finally, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k+1} - \Theta_{t+k+1|t+k})\\ &= \cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\,\vF_{t+k}^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vI - \vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}\Sigma_{t+k|t+k-1}\Big)\,\vF_{t+k}^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vF_{t+k}[\vI - \vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}\Sigma_{t+k|t+k-1}]^\intercal\Big)^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vF_{t+k}[\vI - \Sigma_{t+k|t+k-1}\,\vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}]\Big)^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k+1}\vM_{t+\ell}^\intercal\right) \end{aligned} $$ as desired.

$$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 3.1

Here we show Proposition 3.1.

By definition, $$ \begin{aligned} \Sigma_{1|0} &= \var(\Theta_1)\\ &= \var(\vF_0\,\Theta_0 + U_0)\\ &= \vF_0\,\var(\Theta_0)\,\vF_0^\intercal + \var(U_0), \end{aligned} $$ where the last equality follows from assumption (A.8)—$\cov(\Theta_0, U_t) = 0$ for all $t$. $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 3.2

Here we show Proposition 3.2.

Recall $\Sigma_{t|t-1} = \var(\Theta_t - \Theta_{t|t-1})$.

By definition $({\rm SSM.1})$ and Proposition 2.2., the difference between the latent variable and the predicted latent variable takes the form $$ \begin{aligned} \Theta_t - \Theta_{t|t-1} &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1} - \vF_{t-1}\,\Theta_{t-1|t-1}\\ &= \vF_{t-1}\,\left(\Theta_{t-1} - \Theta_{t-1|t-1}\right) + U_{t-1}. \end{aligned} $$

Next, following Property (P.3) and Assumption (A.2), the EVC takes the form $$ \begin{aligned} &\Sigma_{t|t-1}\\ &=\var(\Theta_t - \Theta_{t|t-1})\\ &= \var\Big(\vF_{t-1}\,\left[\Theta_{t-1} - \Theta_{t-1|t-1}\right] + U_{t-1}\Big)\\ &= \vF_{t-1}\,\var(\Theta_{t-1} - \Theta_{t-1|t-1})\,\vF_{t-1}^\intercal + \var(U_{t-1})\\ &= \vF_{t-1}\,\Sigma_{t-1|t-1}\,\vF_{t-1}^\intercal + Q_{t-1}. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$

Proof of Proposition 3.3

Here we show Proposition 3.3.

Following $({\rm EVC.1})$ in Proposition 2.1, we have $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t) - \sum_{k=1}^t \vK_{t,k}\,\var({\cal E}_k)\,\vK_{t,k}^\intercal\\ \Sigma_{t|t-1} &= \var(\Theta_t) - \sum_{k=1}^{t-1} \vK_{t,k}\,\var({\cal E}_k)\,\vK_{t,k}^\intercal, \end{aligned} $$ with $\vK_{t,j} = \cov(\Theta_t, {\cal E}_j)\,\var({\cal E}_j)^{-1}$.

Next, taking the difference between the predicted EVC and the EVC at time $t$, we obtain $$ \begin{aligned} \Sigma_{t|t} - \Sigma_{t|t-1} &= -\vK_{t,t}\,\var({\cal E}_t)\,\vK_{t,t}^\intercal. \end{aligned} $$

To obtain the desired form, let $\vK_t = \vK_{t,t} = \cov(\Theta_t, {\cal E}_t)\var({\cal E}_t)^{-1}$. Following Theorem 2.6 with $j=t$, we obtain $ \cov(\Theta_t, {\cal E}_t) = \Sigma_{t|t-1}\,\vH_t^\intercal $. So that $$ \begin{aligned} \vK_t &= \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\\ &= \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}. \end{aligned} $$ Finally, we obtain $$ \begin{aligned} \vK_t &= \Sigma_{t|t-1}\,\vH_t^\intercal\,S_{t}^{-1},\\ \Sigma_{t|t} &= \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal, \end{aligned} $$ and alternatively $$ \Sigma_{t|t} = \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}. $$ $$ \ \tag*{$\blacksquare$} $$


References and further reading

  1. Svetunkov, Ivan. Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM), Section 8.1. United States, CRC Press, 2023.
  2. Nau, Robert. “Statistical forecasting: notes on regression and time series analysis. 2019.” URL https://people.duke.edu/~rnau/411arim.htm. (2019).

  1. I like to think of SSMs as the equivalent of Plato’s cave for statistical models: our platonic object are the latent variables, which we never get to see, and the noisy observations are the distorted shadows of the object. ↩︎

  2. Battin, Richard H. “Space guidance evolution-a personal narrative.” Journal of Guidance, Control, and Dynamics 5.2 (1982): 97-110. ↩︎

  3. See https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity ↩︎

  4. Eubank, Randall L. A Kalman filter primer. Chapman and Hall/CRC, 2 ↩︎

  5. See Zivot, Eric. “State Space Models and the Kalman Filter.” Apr 9 (2006): 1-8. ↩︎

  6. We defined each of these quantities as the best linear unbiased predictor (BLUP) for the signal at time $t$, given measurements up to time $j$. ↩︎

  7. See the inverse of a lower triangular matrix is lower triangular on math.stack exchange. ↩︎

  8. See property 4 in https://en.wikipedia.org/wiki/Covariance#Covariance_of_linear_combinations ↩︎