Filtering notes (II): state-space models
Part II of the filtering notes series:
- Part I: signal plus noise models
- Part II: state-space-models*
Introduction
Whether modelling the trajectory of a rocket, forecasting economic trends, or powering large language models, discrete linear state-space models (SSMs) provide a unifying framework for estimation, prediction, and learning.
At their core, a linear SSM models observations as projections of an unknown, time-dependent latent process.1 A classic example is object tracking, where noisy measurements—like the angle between a rocket and two stars—are used to reconstruct the rocket’s trajectory.2 A simpler toy example of an SSM is presented in figure below, where we show the trajectory a 2D particle with noisy observations: here, the solid line represents the true (but unknown) path, while the dots denote the noisy measurements.
Thanks to their mathematical tractability and computational efficiency, linear SSMs extend far beyond tracking. They appear in time-series forecasting, online learning, and even in large language models.
In this post, we explore the mathematical foundations that make SSMs tractable and practical. We outline the assumptions needed for an SSM to be well-specified and highlight a key result: error uncertainty in an SSM is fully characterised by its components and does not require for access to measurements.
The structure
This post is structured as follows:
We define the linear state-space model (SSM), outline assumptions for a well-specified SSM, and illustrate this with a simple two-dimensional example.
We derive the best linear unbiased predictor (BLUP) of the signal at time t, given measurements up to time j, covering filtering, prediction, and smoothing.
We explore key SSM properties: we examine forecasting, derive innovations (residuals) and their variance, and present covariance formulas crucial for closed-form computation of the BLUP.
We derive the error variance covariance (EVC) matrix for one-step-ahead prediction and filtering, showing that uncertainty in an SSM can be computed without observed measurements, highlighting the known result that classical “optimal filters” are not adaptive.
We review autoregressive (AR) and moving average (MA) models, showing their SSM form, and extend this to ARMA models.
Finally, we conclude with a recap and discussion.
The state-space model
Consider a $d$-dimensional signal-plus noise (SPN) measurement process, i.e., $Y_t = F_t + E_t$, where $Y_t$ is the measurement, $F_t$ is the signal, and $E_t$ is the noise.
In an SSM, we decompose the signal $F_t$ as the product of a $d\times m$ projection matrix $\vH_t$ times a latent variable $\Theta_t \in {\mathbb R}^m$. Using this decomposition, the SPN process gets written as $$ \tag{M.1} \begin{aligned} F_t &= \vH_t\,\Theta_t,\\ Y_t &= F_t + E_t. \end{aligned} $$ Furthermore, the latent variable $\Theta_t$ is taken to be time-dependent and evolving according the equation $$ \Theta_t = \vF_{t-1}\,\Theta_{t-1} + U_{t-1}, $$ where, $\vF_t$ is an $m\times m$ transition matrix, and $U_t$ is a zero-mean random variable with $\var(U_t) = Q_t$. Finally, the SPN process under the SSM assumption at time $t$ takes the form $$ \tag{SSM.1} \boxed{ \begin{aligned} \Theta_t &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1},\\ Y_t &= \vH_t\,\Theta_t + E_t. \end{aligned} } $$
The SSM evolves over time as follows $$ \begin{array}{c|ccc} \text{time} & \text{latent} & \text{signal} & \text{measurement}\\ \hline 0 & \Theta_0 & - & - \\ 1 & \Theta_1 = \vF_0\,\Theta_0 + U_0 & F_1 = \vH_1\,\Theta_1 & Y_1 = F_1 + E_1\\ 2 & \Theta_2 = \vF_1\,\Theta_1 + U_1 & F_2 = \vH_2\,\Theta_2 & Y_2 = F_2 + E_2\\ \vdots & \vdots & \vdots & \vdots \\ T & \Theta_T = \vF_{T-1}\,\Theta_{T-1} + U_{T-1} & F_T = \vH_t\,\Theta_T & Y_T = F_T + E_T\\ \end{array} $$
The term $(\text{SSM.1})$
- decouples the dimension of the modelled signal $\Theta_t \in \mathbb{R}^m$, to that of the observed process $Y_t \in \mathbb{R}$ and
- imposes a time-dependent structure over the latent state $\Theta_t$.
A 2d-tracking example
Consider the two-dimensional linear dynamical system $$ \begin{aligned} &\vF_t = \begin{bmatrix} 1 - dt + \frac{dt^2}{2} & -dt + dt^2 \\ 6\,dt - 6\,dt^2 & 1 - dt + \frac{dt^2}{2} \end{bmatrix}, &Q_t = 0.1^2\,\vI_2,\\ &\vH_t = \vI_2, &R_t = 0.05^2\,\vI_2, \end{aligned} $$ with $dt = 0.005$, $\Theta_0 \sim {\cal U}[0.5, 1.2]^2$, and $\vI_2$ an identity matrix of size $2$.
A sample of the latent process $\Theta_{1:T}$ and measurements $Y_{1:T}$ is shown in the figure above. There, the dots represent the sampled measurements $y_{1:t}$ and the solid line represents the (unknown) sampled latent process $\theta_{1:t}$.
The best linear unbiased predictor: from signals to latent variables
Let $F_t \in \reals^d$ be the signal process and let $ Y_{1:j} \in \reals^{(j\,d)\times 1} $ be a block vector of measurements. Our goal is to estimate the block matrix $ \vA\in \reals^{d\times(j\,d)} $ that minimises the expected L2 distance between the signal $F_t$ and the measurements $Y_{1:j}$. Following F01, Proposition 1 this matrix is given by
$$ \vA^* = \argmin_{\vA} \mathbb{E}[\|F_t - \vA\,Y_{1:j}\|_2^2] = \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}, $$ where $\cov(F_j, Y_{1:j}) \in \reals^{d \times (j\,d)}$ and $\var(Y_{1:j})^{-1} \in \reals^{(j\,d)\times (j\,d)}$.
Having found $\vA^*$, the estimate $$ Y_{t|j} = \vA^*\,Y_{1:j} = \cov(F_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j} $$ is called the best linear unbiased predictor (BLUP) of the signal $F_t$ given the measurements $Y_{1:j}$. The name of this estimate depends on the relationship between $t$ and $j$:
- filtering: when $t = j$,
- prediction: when $t > j$, and
- smoothing: when $t \leq j = T$.
Because $F_t = \vH_t\Theta_t$, it follows that $$ \begin{aligned} \vA^* &= \cov(F_j, Y_{1:t})\,\var(Y_{1:t})^{-1}\\ &= \cov(\vH_j\,\Theta_j, Y_{1:t})\,\var(Y_{1:t})^{-1}\\ &= \vH_j\,\cov(\Theta_j, Y_{1:t})\,\var(Y_{1:t})^{-1}. \end{aligned} $$ Coincidentally, the term $\cov(\Theta_j, Y_{1:t})\,\var(Y_{1:t})^{-1}$ corresponds to the $m\times (j d)$ matrix $\vB^*$ such that $$ \vB^* = \argmin_{\vb} \mathbb{E}\left[\|\Theta_t - \vB\,Y_{1:j}\|_2^2\right] = \cov(\Theta_t, Y_{1:j})\,\var(Y_{1:j})^{-1}. $$ As a consequence, the BLUP of the latent variable $\Theta_t$ given measurements $Y_{1:j}$ is $$ \tag{BLUP.1} \boxed{ \Theta_{t|j} = \vB^*\,Y_{1:j} = \cov(\Theta_t, Y_{1:j})\,\var(Y_{1:j})^{-1}\,Y_{1:j}. } $$
This result shows that the BLUP for the latent state is linked to the BLUP for the signal. Therefore, we will focus on deriving BLUPs for the latent state, from which the BLUP for the signal can be easily obtained. This relationship is highlighted in the remark below.
In a linear SSM, the BLUP of the signal and the BLUP of the latent variable are equivalent up to a matrix that projects from latent space to signal space. That is $F_{t|j} = \vH_t\,\Theta_{t|j}.$
Building on the previous post, our next step—having derived $({\rm BLUP.1})$—is to express the BLUP in terms of innovations rather than measurements. As the following proposition shows, expressing the BLUP in terms of innovations leads to results in a sum of $j$ terms, each requiring the inversion of a $(d \times d)$ matrix. This approach is notably more efficient than inverting a $(j\,d) \times (j\,d)$ matrix at every time frame $j$ of interest.
Proposition 1: BLUP of the latent state and innovations
The best linear unbiased predictor (BLUP) of the latent variable $\Theta_t$, given innovations ${\cal E}_{1:j}$, is $$ \tag{BLUP.2} \boxed{ \begin{aligned} \Theta_{t|j} &= \sum_{k=1}^j \vK_{t,k}\, {\cal E}_k,\\ \vK_{t,j} &= \cov(\Theta_t, {\cal E}_j)\,\var({\cal E}_j)^{-1},\\ {\cal E}_j &= Y_j - \sum_{\ell=1}^{j-1} \cov(Y_j, {\cal E}_\ell)\,\var({\cal E}_\ell)^{-1}\,{\cal E}_\ell. \end{aligned} } $$ Furthermore, the error variance-covariance (EVC) matrix between the BLUP and the latent variance takes the form $$ \tag{EVC.2} \Sigma_{t|j} := \var(\Theta_t - \Theta_{t|j}) = \var(\Theta_t) - \sum_{k=1}^j \vK_{t,k}\,\var({\cal E}_k)\,\vK_{t,k}^\intercal. $$ Here, $\cov(\Theta_t, {\cal E}_k) \in \reals^{m\times d}$, $\var({\cal E}_k)\in \reals^{d\times d}$, and $\Sigma_{t|j} \in \reals^{m\times m}$.
See the Appendix for a proof.
As we see, the main quantities to compute the BLUP are
- the innovations ${\cal E}_\ell$,
- the variance of the innovations $\var({\cal E}_k)$, and
- the covariance between the latent states and innovations $\cov(\Theta_t, {\cal E}_k)$.
In the rest of this post, we show that each of these terms can be computed in closed form.
Analytical properties of state-space models
In this section, we derive the terms to estimate the BLUP and the EVC. These building blocks will later allows us to derive filtering methods.
We begin by writing down the assumptions for an SSM.
Assumptions for an SSM
As we will see, the computation of the BLUP for an SSM is relatively straightforward thanks to its mathematical properties. However, for an SSM to be well-specified, we require the following list of assumptions about the data-generating process:
- (A.1) $\var(\Theta_0) = \Sigma_{0|0}$ — known and positive definite,
- (A.2) $\var(U_t) = Q_t$ — known and positive semi-definite,
- (A.3) $\var(E_t) = R_t$ — known and positive definite,
- (A.4) $\cov(E_t, E_s) = 0$ for $t\neq s$,
- (A.5) $\cov(U_t, U_s) = 0$ for $t \neq s$,
- (A.6) $\cov(E_t, U_s) = 0$ for all $t,s$,
- (A.7) $\cov(E_t, \Theta_0) = 0$,
- (A.8) $\cov(U_t, \Theta_0) = 0$,
- (A.9) $\vH_t$ — known $d\times m$ projection matrix, and
- (A.10) $\vF_t$ — known $m\times m$ transition matrix.
Conditions (A.2, A.3, A.6 — A.10) hold for $t \in {\cal T}$.
Some important assumptions to note are:
- (A.2) and (A.3) — the SSM is a heterogenious process with known time-varying variances.3
- (A.3), (A.4), and (A.5) — the observation noise and the signal noise are all uncorrelated through time and through states.
- (A.7) and (A.8) — the initial condition does not affect future noise terms
As a consequence of assumptions (A.1) - (A.10), we arrive at the following basic properties of an SSM.
Proposition 2.1 — basic properties of SSMs
- (P.1) $\cov({\cal E}_t, {\cal E}_s) = 0$ for all $t \neq s$,
- (P.2) $\cov(E_t, \Theta_s) = 0$ for all $(t,s) \in {\cal T}^2$.
- (P.3) $\cov(\Theta_s, U_t) = \cov(Y_s, U_t) = \cov({\cal E}_s, U_t) = 0$ for $s \leq t$.
- (P.4) $\cov({\cal E}_s, E_t) = 0$ for $s < t$.
For a proof, see the Appendix.
Proposition 2.1 offers the following intuitive insights:
- (P.1) — innovations are uncorrelated.
- (P.2) — the latent variable and the observation noise are uncorrelated Meaning that errors in measurements space do not affect the path of the latent process,
- (P.3) — latent variables, measurements, and innovations do not affect current and future noise dynamics, and
- (P.4) — innovations do not affect future observation noise terms.
Proposition 2.1 provides the foundation for four key results, including the estimation and prediction of the BLUP and the EVC, the derivation of the form of innovations in an SSM, the analytical expression for the variance of the innovation, and the covariance between the latent state and the innovations.
We outline these results below.
Proposition 2.2 — forecasting the filtered BLUP
For $\ell \geq 1$, the BLUP of $\Theta_{t + \ell}$ given $Y_{1:t}$ is $$ \Theta_{t+\ell|t} = \vF_{t+\ell-1}\,\vF_{t+\ell-2}\,\ldots\,\vF_{t}\,\Theta_{t|t} $$
See the appendix for a proof.
Forecasting the BLUP in an SSM amounts to pre-multiplying the product of future (and known) transition matrices.
Proposition 2.3 – error variance covariance of the forecasted filtered BLUP
For $k \geq 1$, the EVC of $\Theta_{t + k}$ given the forecasted BLUP $\Theta_{t+k | t}$ is $$ \var\left(\Theta_{t+k} - \Theta_{t+k|t}\right) = \left[\prod_{\ell=1}^k \vF_{t+k-\ell}\right]\,\Sigma_{t|t}\,\left[\prod_{\ell=1}^k \vF_{t+k-\ell}\right]^\intercal + \sum_{\ell=1}^{k}\,\left[\prod_{j=1}^{\ell-1}\vF_{t+k-j}\right]\,Q_{t+k-\ell}\,\left[\prod_{j=1}^{\ell-1}\vF_{t+k-j}\right]^\intercal, $$ where we use the convention $\prod_{j=1}^0 \vA_{j} = \vI$, for some matrix $\vA_j$.
See the Appendix for a proof.
Proposition 2.4 — innovations under an SSM
The innovation process ${\cal E}_{1:T}$ in an SSM takes the form $$ \begin{aligned} {\cal E}_t &= Y_t - \hat{Y}_{t|t-1}\\ &= \vH_t\,(\Theta_t - \Theta_{t|t-1}) + E_t. \end{aligned} $$ with $\hat{Y}_{t|t-1} = \vH_t\,\Theta_{t|t-1}$.
See the Appendix for a proof.
In an SSM, the innovation is the one-step-ahead prediction error (also known as the residual).
Proposition 2.5 — variance of innovation
Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, for $t \in {\cal T}$, $$ S_t = \var({\cal E}_t) = \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t. $$
See the Appendix for a proof.
Theorem 2.6 — the fundamental covariance structure
Let $\Sigma_{1|0} = \var(\Theta_1)$, then $$ \cov(\Theta_t, {\cal E}_j) = \begin{cases} {\color{orange} \overleftarrow\vF_{t-1:j}}\,\Sigma_{j | j-1}\vH_j^\intercal & j \leq t-1,\\ \Sigma_{t|t-1}\,\vH_t^\intercal & j = t,\\ \Sigma_{t|t-1}\,{\color{#00B7EB} \overrightarrow\vM_{t+1:j}^\intercal}\,\vH_j^\intercal & j \geq t+1. \end{cases} $$ Here, $$ \begin{aligned} {\color{orange}\overleftarrow{\vF}_{t-1:j}} &= \prod_{\ell=1}^{t-j} \vF_{t-\ell}, & \text{ for } j \leq t-1,\\ {\color{#00B7EB} \overrightarrow{\vM}_{t+1:j}^\intercal} &= \prod_{\ell=1}^{j-t} \vM_{t+\ell}^\intercal, & \text{ for } j \geq t+1, \end{aligned} $$ with $$ \vM_t = \vF_t\,(\vI - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t). $$
See the Appendix for a proof.
Quantifying uncertainty: the error-variance covariance matrix
The measure of uncertainty in an SSM is represented through the error variance-covariance (EVC) matrix. Two of the most common measures of uncertainty are
- the predicted EVC: the error variance between $\Theta_t$ and BLUP $\Sigma_{t|t-1} = \Theta_{t|t-1}$: $\var(\Theta_t - \Theta_{t|t-1})$,
- the updated EVC: the error variance between $\Theta_t$ and $\Sigma_{t|t} = \Theta_{t|t}$: $\var(\Theta_t - \Theta_{t|t})$.
The first option quantifies the uncertainty about the value of the next state, while the second option quantifies the uncertainty about the value of the current state. As we will see in a later post, these two quantities are important to derive the so-called Kalman filter.
The following three propositions provide an analytic form for the predicted and updated EVC.
Proposition 3.1 — initial ECV
Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, $$ \Sigma_{1|0} = \vF_0\,\Sigma_{0|0}\,\vF_0^\intercal + Q_0. $$
See the Appendix for a proof.
Proposition 3.2 — predicted EVC
Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, for $t \in {\cal T}$, $$ \Sigma_{t|t-1} = \vF_t\,\Sigma_{t-1|t-1}\,\vF_t^\intercal + Q_{t-1}. $$ See the Appendix for a proof.
Proposition 3.3 — updated EVC
Let $\Sigma_{1|0} = \var(\Theta_1)$. Then, for $t \in {\cal T}$, $$ \Sigma_{t|t} = \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal, $$ with $\vK_t = \Sigma_{t|t-1}\,\vH_t\,S_t^{-1}$.
Alternatively, $$ \Sigma_{t|t} = \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\, S_t^{-1}\,\vH_t\,\Sigma_{t|t-1} $$
See the Appendix for a proof.
From a Bayesian filtering perspective, the error variance-covariance matrix $\Sigma_{t|t}$ is known as the posterior covariance, which, congruent with the intuition developed in this section, represents the uncertainty about the value of the latent variable $\Theta_t$—it measures how far off can we expect our estimate for the latent state to be.
Pseudocode for the error covariance matrix
Following Propositions 3.1, 3.2 and 3.3., the predicted and filtered EVC can be written in a sequential manner. We show ths in the pseudocode below $$ \boxed{ \begin{array}{l|l} \Sigma_{1|0} = \vF_0\,\Sigma_{0|0}\,\vF_0 + Q_0 & \var(\Theta_1)\\ S_1 = \vH_1\,\Sigma_{1|0}\,\vH_t^\intercal + R_1 & \var({\cal E}_1)\\ \Sigma_{1|1} = \Sigma_{1|0} - \Sigma_{1|0}\,\vH_1^\intercal\,S_1^{-1}\,\vH_1\, \Sigma_{1|0} & \var(\Theta_1 - \Theta_{1|1})\\ \text{for } t=2, \ldots, T:\\ \quad \Sigma_{t|t-1} = \vF_t\,\Sigma_{t-1|t-1}\,\vF_t^\intercal + Q_t & \var(\Theta_t - \Theta_{t|t-1})\\ \quad S_t = \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t & \var({\cal E}_t)\\ \quad \Sigma_{t|t} = \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1} & \var(\Theta_t - \Theta_{t|t}) \end{array} } $$
Uncertainty estimation in a linear SSM does not depend on the seen data and is fully characterised the known building blocks of the SSM, namely $\vF_t$, $\vH_t$, $Q_t$, and $R_t$. The uncertainty is considered misspecified if the EVC is estimated using components that differ from the true model, for example, if the EVC is modelled using $\bar\vF_t \neq \vF_t$.
The autoregressive moving average model (ARMA) as a state-space model
We conclude this post presenting a classical example of time-series forecasting: the unidimensional ARMA model in state-space form.4 As we will see, these models can be written in the form $$ \begin{aligned} \tag{ARMA.0} \Theta_t &= \vF\,\Theta_{t-1} + \vT\,E_{t-1}\\ Y_t &= \vH_t\,\Theta_t + E_t, \end{aligned} $$ with $U_t = \vT\,E_{t-1}$ the dynamics of the model.
For a full code example, see this notebook.
Sampling an SSM
To sample a step of the SSM above, we require the matrices $\vF$, $\vH$, $\vT$, and $\vR$. This can be easily written as a jax step as shown in code below.
|
|
Definition: a moving average (MA) process
A time series $Y_{1:T}$ is said to be an $m$-th order moving average process — MA($m$) — if there are known $a_1, \ldots, a_m$ coefficients such that $$ \tag{MA.1} \boxed{ Y_t = \sum_{k=1}^{m}a_k\,E_{t - k} + E_t, } $$ for uncorrelated zero-mean random variables $E_{1}, \ldots, E_T$ with common covariance $R = \var(E_t)$ for $t \in {\cal T}$, and initialising terms $E_{-1} = \ldots = E_{-m} = 0$.
The MA process as a state-space model
To write $({\rm MA.1})$ as an SSM, define the $m$-dimensional random vector $$ \Theta_{t} = \begin{bmatrix} E_{t-1}\\ \vdots\\ E_{t-m} \end{bmatrix}, $$
Next, define the $(m\times m)$ transition matrix $$ \vF_{\rm MA} = \begin{bmatrix} 0 & 0 & \cdots & 0 & 0\\ 1 & 0 & \cdots & 0 & 0\\ 0 & 1 & \cdots & 0 & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & \cdots & 1 & 0 \end{bmatrix}. $$ Let $$ U_{t-1} = {\cal T}_m\,E_{t-1}, $$ be the dynamics of the model with $$ {\cal T}_m = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}. $$
Next, the projection matrix takes the form $$ \vH_{\rm MA} = \begin{bmatrix} a_1& a_2& \ldots& a_m \end{bmatrix}. $$
Finally, the SSM form for an MA($m$) process takes the form $$ \tag{MA.2} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm MA}\,\Theta_{t-1} + {\cal T}_m\,E_{t-1},\\ Y_t &= \vH_{\rm MA}\,\Theta_t + E_t. \end{aligned} } $$ The term $\vF_{\rm MA}\,\Theta_{t-1}$ maintains the most recent $m-1$ noise terms and $U_{t-1} = {\cal T}_m\,E_{t-1}$ includes the last error term.
Because the SSM components are fixed and specified at the beginning of the experiment, we can easily write a function that defines each of these components by passing the vector of elements $\vH_{\rm MA}$:
|
|
Example — MA(1)
The following plot shows a sample of an MA($m$) process where $a_1 = a_2 = \ldots = a_m = 1$ and we vary $m$.
As we see, higher values of $m$ result in a more smooth process, while lower values of $m$ result in a more rough process.
Next, we show the autocorrelation of the sample above.
Here, the autocorrelation is defined by
$$
\tag{ACorr.1}
\rho_{k}
= \frac{\cov(Y_t, Y_{t+k})}{\sqrt{\var(Y_t)}\,\sqrt{\var(Y_{t+k})}}.
$$
Here, we observe that longer values of $m$ result in higher autocorrelation, which corresponds to a longer dependence between the current and future values.
Definition: An autoregressive (AR) process
A time series $Y_{1:T}$ is said to be an $r$-th order autoregressive process—AR($r$)—if there are known $b_1, \ldots, b_r$ coefficients such that
$$ \tag{AR.1} \boxed{ Y_t = \sum_{k=1}^r b_k\,Y_{t - k} + E_t, } $$ with $\var(E_{1:T}) = R\,\vI_T$ and initial terms $Y_{-1} = \ldots = Y_{-r} = 0$.
The AR process as a state-space model
To write $({\rm AR.1})$ as an SSM, define the $r$-dimensional state vector $$ \Theta_t = \begin{bmatrix} Y_{t-1}\\ \vdots \\ Y_{t-r} \end{bmatrix}, $$ for $t \geq 0$. Observe that $\Theta_0 = \vzero$.
Next, define the $r\times r$ transition matrix that maintains the previous $y_{t-2}, \ldots, y_{t-r-1}$ observations and builds the noisless observation $y_{t-1}$: $$ \vF_{\rm AR} = \begin{bmatrix} b_1 & b_2 & \cdots & b_{r-1} & b_r\\ 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & \cdots & 1 & 0 \\ \end{bmatrix}, $$ and $$ U_{t-1} = {\cal T}_r\,E_{t-1}. $$ Next, the projection matrix takes the form $$ \vH_{\rm AR} = \begin{bmatrix} b_1, \ldots, b_r \end{bmatrix}. $$
We write the AR($r$) process as an SSM as $$ \tag{AR.2} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm AR}\,\Theta_{t-1} + {\cal T}_r\,E_{t-1},\\ Y_t &= \vH_{\rm AR}\,\Theta_t + E_t. \end{aligned} } $$
The term $\vF_{\rm AR}\,\Theta_{t-1}$ maintains the most recent $(r-1)$ measurements and builds the noisless observation at time $(t-1)$. Next, $U_{t-1} = {\cal T}_m\,E_{t-1}$ includes the noise term at time $(t-1)$.
Similar to the MA case, we can define a function that defines the components of the AR process, provided we are given $\vH_{\rm AR}$:
|
|
Example — AR(1)
The following plot shows a sample of an AR($1$) process where $b_1 \in [0,1]$.
As we see,
values of $b_1$ close to $0$ result in a mean reverting process, while
values of $b_1$ close to $1$ result in a trend following (or momentum) process.
Next, we plot the autocorrelation of the AR($1$) sample above.
As expected,
negative values of $b_1$ result in a mean reverting process, whose autocorrelation oscillates, while
positive values of $b_1$ result in a trend-following process, whose autocorrelation is positive.
Definition: An autoregressive moving average (ARMA) model.
A time series $Y_{1:T}$ is said to be an $r$-th order autoregressive $m$-th order moving average process — ARMA($r$, $m$) — if there are known $(b_1, \ldots, b_r)$ and $(a_1, \ldots, a_m$) coefficients such that $$ \tag{ARMA.1} \boxed{ Y_t = \sum_{k=1}^{r}b_k\,Y_{t - k} + \sum_{j=1}^m\,a_j\,E_{t-j} + E_t } $$ with $\var(E_{1:T}) = R\,\vI_T$ and initial terms $Y_{-1} = \ldots = Y_{-r} = E_{-1} = \ldots = E_{-m} = 0$.
Following $({\rm AR.2})$ and $({\rm MA.2})$, we rewrite the ARMA($r$, $m$) process as the SSM $$ \tag{ARMA.2} \boxed{ \begin{aligned} \Theta_t &= \vF_{\rm ARMA}\,\Theta_{t-1} + {\cal T}_{r,m}\,E_{t-1},\\ Y_t &= \vH_{\rm ARMA}\,\Theta_t + E_t, \end{aligned} } $$
with components. $$ \begin{aligned} \vF_{\rm ARMA} &= \begin{bmatrix} \vF_{\rm AR} & \vzero_{r,m} \\ \vzero_{m,r} & \vF_{\rm MA} \end{bmatrix},\\ \vH_{\rm ARMA} &= \begin{bmatrix} \vH_{\rm AR} & \vH_{\rm MA} \end{bmatrix},\\ {\cal T}_{r,m} &= \begin{bmatrix} {\cal T}_r \\ {\cal T}_m \end{bmatrix}. \end{aligned} $$
Programmatically, to define an ARMA process, we simply stack the building blocks defined above
|
|
A note on misspecification
The form of the dynamics noise $U_t$ in $({\rm ARMA.0})$ is such that $$ \cov(U_{t+1}, E_s) = \cov(\vT\,E_{t}, E_{s}) = \vT\,\var(E_{t})\,\mathbb{1}(t = s), $$ which directly conflicts with assumption (A.6).
An alternative approach reformulates the SSM so that the only source of noise is through the dynamics $U_t$, so that $R_t = 0$.5 This version, however, contradicts assumption (A.3), which stipulates that $R_t$ must be positive definite. Despite this contradiction, it remains possible to work with this alternative formulation, provided that $\vH_t^\intercal,\Sigma_{t|t-1},\vH_t$ remains positive definite, which is necessary because we need to invert this quantity.
Closing thoughts: a computational argument for state-space models
In the previous post, we introduced a data-driven approach to filtering, prediction, and smoothing6. By decorrelating measurements in a pre-processing step, we obtained a linear (in-time) algorithm for estimating these quantities.
However, this approach relies on three limiting assumptions:
- the experiment’s length must be fixed and known,
- we need multiple simulations of the true signal, and
- the signal’s dimension must match that of the measurements.
These constraints are often impractical. In real-world scenarios, experiment durations are unknown (e.g., robot navigation or stock trading); we rarely have multiple observations of an underlying signal; and incorporating external data (e.g., news sentiment in stock trading) requires flexibility in dimensions.
SSMs overcome these computational challenges by:
- running indefinitely,
- requiring only one sample run for estimation, and
- decoupling the signal and measurement dimensions via a latent process.
In the next post, we will use the properties of SSMs derived here to develop the classical Kalman filtering algorithm.
Despite their computational advantages, SSMs have limitations. Notably, as we have seen, obtaining a well-specified estimate of uncertainty requires full knowledge of the system dynamics—a constraint that may not always hold in practical applications.
Appendix
Some helpful quantities
The latent variable at time $t$ can be decomposed into the initial state $\Theta_0$ and the noise terms $(U_{1}, \ldots, U_{t-1})$ as $$ \tag{ssm.aux.1} \begin{aligned} \Theta_t &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1}\\ &= \left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j}, \end{aligned} $$ with the convention $\prod_{a}^b \vF_a = 1$ for $b < a$. As a consequence, the observation at time $t$ takes the form $$ \begin{aligned} \tag{ssm.aux.2} Y_t &= \vH_t\,\Theta_t + E_t\\ &= \vH_t\,\left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \vH_t\,\sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j} + E_t. \end{aligned} $$
Recall from (I.3) shown in filtering-notes-i that $Y_{1:T} = \vL\,{\cal E}_{1:T} \implies {\cal E}_t = \sum_{i=1}^t\vL_{t,i}^{-1}\,Y_{i}$, which follows from $\vL$ being a lower-triangular7 matrix. Furthermore, the cholesky decomposition for the measurements takes the form $\var(Y_{1:T}) = \vL\,S\,\vL^\intercal$, with $S$ a diagonal matrix.
Proof of Proposition 1
Here we show Proposition 1
$$ \begin{aligned} \Theta_{t|j} &=\cov(\Theta_t, Y_{1:j})\var(Y_{1:j})^{-1}\,Y_{1:j}\\ &=\cov(\Theta_t, \vL{\cal E}_{1:j})\var(\vL{\cal E}_{1:j})^{-1}\,\vL{\cal E}_{1:j}\\ &= \cov(\Theta_t, {\cal E}_{1:j})\,\vL^\intercal\left[ \vL\var({\cal E}_{1:j}) \vL^\intercal\right]^{-1}\vL{\cal E}_{1:j}\\ &= \cov(\Theta_t, {\cal E}_{1:j})\,\vL^\intercal\vL^{-\intercal}\var({\cal E}_{1:j})^{-1}\vL^{-1}\vL{\cal E}_{1:j}\\ &= \cov(\Theta_t, {\cal E}_{1:j})\,{\rm diag}(S_1, \ldots, S_j)\,{\cal E}_{1:j}\\ &= \begin{bmatrix} \cov(\Theta_1, {\cal E}_1) & \ldots & \cov(\Theta_1, {\cal E}_j) \end{bmatrix} \begin{bmatrix} S_1 & 0 & \ldots & 0\\ 0 & S_2 & \ldots & 0\\ 0 & 0 & \vdots & S_j\\ \end{bmatrix}^{-1} \begin{bmatrix} {\cal E}_1 \\ \vdots \\ {\cal E}_j \end{bmatrix}\\ &= \begin{bmatrix} \cov(\Theta_t, {\cal E}_1) & \ldots & \cov(\Theta_t, {\cal E}_j) \end{bmatrix} \begin{bmatrix} S_1^{-1}\,{\cal E}_1 \\ \vdots \\ S_j^{-1}{\cal E}_j \end{bmatrix}\\ &=\sum_{k=1}^j \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}\,{\cal E}_k\\ &=\sum_{k=1}^j \vK_{t,j}\,{\cal E}_k, \end{aligned} $$ with $\vK_{t,j} = \cov(\Theta_t, {\cal E}_k)\,S_k^{-1}$, $S_t = \var({\cal E}_t)$, and ${\cal E}_j$ takes the form $({\rm I.1})$ in F.01, innovations.
Next, to derive the error variance-covariance (EVC) matrix, note that $$ \begin{aligned} \Sigma_{t|j} &= \var(\Theta_t - \Theta_{t|j})\\ &= \var\left(\Theta_t - \sum_{k=1}^j \vK_{t,k}\,{\cal E}_k\right)\\ &= \var(\Theta_t) + \var\left(\sum_{k=1}^j \vK_{t,j}\,{\cal E}_k\right) - \cov\left(\Theta_t,\,\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k\right) - \cov\left(\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k,\,\Theta_t\right). \end{aligned} $$
Where, $$ \begin{aligned} \var\left(\sum_{k=1}^j \vK_{t,j}\,{\cal E}_k\right) &= \cov\left(\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k,\,\sum_{\ell=1}^j \vK_{t,\ell}\,{\cal E}_\ell\right)\\ &= \sum_{k=1}^j\sum_{\ell=1}^j \vK_{t,k}\,\cov({\cal E}_k,\, {\cal E}_{\ell})\,\vK_{t,\ell}^\intercal\\ &= \sum_{k=1}^j \vK_{t,k}\cov({\cal E}_{k},\,{\cal E}_{k})\,\vK_{t,k}^\intercal\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E}_{k})\,\vK_{t,k}^\intercal, \end{aligned} $$ $$ \begin{aligned} \cov\left(\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k,\,\Theta_t\right) &= \sum_{k=1}^j \vK_{t,k} \cov({\cal E}_k,\,\Theta_t)\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\var({\cal E}_k)^{-1}\cov({\cal E}_k,\,\Theta_t)\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\big(\cov(\Theta_t,\,{\cal E}_k)\var({\cal E}_k)^{-1}\big)^\intercal\\ &= \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal, \end{aligned} $$ and similarly $$ \cov\left(\Theta_t,\,\sum_{k=1}^j \vK_{t,k}\,{\cal E}_k\right) = \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal. $$
So that $$ \begin{aligned} \Sigma_{t|j} &= \var(\Theta_t) + \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal - \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal - \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal\\ &= \var(\Theta_t) - \sum_{k=1}^j \vK_{t,k}\var({\cal E_k})\,\vK_{t,k}^\intercal. \end{aligned} $$
$$ \ \tag*{$\blacksquare$} $$Proof of Proposition 2.1
Here we show the properties for an SSM outlined in Proposition 2.1.
Proof of (P.1)
Because an SSM is a signal plus noise model, it follows from F.01, Proposition 2 that innovations are uncorrelated.
Proof of (P.2)
The proof follows from $({\rm ssm.aux.1})$, and assumptions (A.6) and (A.7).
Let $s,t \in {\cal T}^2$, then $$ \begin{aligned} \cov\left(\Theta_s, E_t\right) &= \cov\left( \left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j}, E_t \right)\\ &= \left(\prod_{k=1}^{t}\vF_{t-k}\right) \underbrace{\cov(\Theta_0, E_t)}_{{\rm (A.6)} = 0} + \sum_{j=1}^t\left(\prod_{k=1}^{j-1}\vF_{t-k}\right) \underbrace{\cov(U_{t-j}, E_t)}_{{\rm (A.7)} = 0}\\ &= 0. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$
Proof of (P.3)
For all results below, we take $s \leq t$.
First, we show $\cov(\Theta_s, U_t) = 0$, i.e., the latent variable $\Theta_s$ is uncorrelated to future (and current) latent movements.
Following $({\rm ssm.aux.1})$, and assumptions (A.5) and (A.8), we obtain $$ \begin{aligned} \tag{P.3.1} \cov(\Theta_s, U_t) &= \cov\left( \left(\prod_{k=1}^{t} \vF_{t-k}\right)\Theta_0 + \sum_{j=1}^{t}\left(\prod_{k=1}^{j-1} \vF_{t-k}\right)U_{t-j}, U_t \right)\\ &= \left(\prod_{k=1}^{t}\vF_{t-k}\right) \underbrace{\cov(\Theta_0, U_{t})}_{{\rm (A.8)} = 0} + \sum_{j=1}^t\left(\prod_{k=1}^{j-1}\vF_{t-k}\right) \underbrace{\cov(U_{t-j}, U_t)}_{{\rm (A.5)} = 0}\\ &= 0 \end{aligned} $$
Second, we show $\cov(Y_s, U_t) = 0$, i.e., the measurement $Y_s$ is uncorrelated to future (and current) latent movements.
The result follows from $({\rm P.3.1})$ and assumption (A.6) $$ \begin{aligned} \tag{P.3.2} \cov(Y_s, U_t) &= \cov(\vH_s\,\Theta_s + E_s, U_t)\\ &= \vH_s\,\cov(\Theta_s, U_t) + \cov(E_s, U_t)\\ &= 0. \end{aligned} $$
Third, we show $\cov({\cal E}_s, U_t) = 0$, i.e., the innovation ${\cal E}_s$ is uncorrelated to future (and current) latent movements $U_t$.
Suppose $s=k$ and $t \geq k$. Then, $$ \begin{aligned} \cov({\cal E}_k, U_t) &= \cov\left(\sum_{i=1}^T\,\vL_{k,i}^{-1}\,Y_i, U_t\right)\\ &= \sum_{i=1}^T\vL_{k,i}^{-1}\,\cov(Y_i, U_t)\\ &= \sum_{i=1}^k\vL_{k,i}^{-1}\,\cov(Y_i, U_t)\\ &= 0. \end{aligned} $$ Here, third equality follows form $\vL$ being lower-triangular—and hence $\vL^{-1}$ is lower-triangular—, and the last equality follows from $({\rm P.3.2})$. $$ \ \tag*{$\blacksquare$} $$
Proof of (P.4)
The property (P.4) follows from property (P.2), assumption (A.4), and the decomposition of innovations shown in F.01, (I.3): $$ \begin{aligned} \cov({\cal E}_t, E_s) &= \cov\left(\sum_{i=1}^t\vL_{t,i}^{-1}\,Y_{i}, E_s\right)\\ &= \sum_{i=1}^t\vL_{t,i}^{-1}\,\cov\left(Y_{i}, E_s\right)\\ &= \sum_{i=1}^t\vL_{t,i}^{-1}\,\cov\left(\vH_i\,\Theta_i + E_i, E_s\right)\\ &= \sum_{i=1}^t\vL_{t,i}^{-1}\,\big(\vH_i\,\cov\left(\Theta_i, E_s\right) + \cov\left(E_i, E_s\right)\big). \end{aligned} $$ Finally, $\cov\left(\Theta_i, E_s\right) = 0$ by property (P.2) and $ \cov\left(E_i, E_s\right) = 0$ by assumption (A.4). $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 2.2
Here we show Proposition 2.2.
Following $({\rm BLUP.1})$, the BLUP of $\Theta_{t+\ell}$ given $Y_{1:t}$ takes the form $$ \Theta_{t+\ell|t} = \cov\left(\Theta_{t+\ell}, Y_{1:t}\right)\,\var\left(Y_{1:t}\right)^{-1}\,Y_{1:t}. $$ Next, by definition, $$ \Theta_{t+\ell} = \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j}. $$ As a consequence, $$ \begin{aligned} &\cov\left(\Theta_{t+\ell}, Y_{1:t}\right)\\ &=\cov \left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j},\, Y_{1:t} \right)\\ &=\left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\cov\left(\Theta_t, Y_{1:t}\right) + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\cov\left(U_{t+\ell-j}, Y_{1:t}\right)\\ &=\left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\cov\left(\Theta_t, Y_{1:t}\right), \end{aligned} $$ where the last line follows from $({\rm P.3.2})$, i.e., $\cov(Y_s, U_t) = 0$ for $s \leq t$. Finally, $$ \begin{aligned} \Theta_{t+\ell|t} &= \cov(\Theta_{t+\ell}, Y_{1:t})\,\var(Y_{1:t})^{-1}\,Y_{1:t}\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\cov\left(\Theta_t, Y_{1:t}\right)\,\var(Y_{1:t})^{-1}Y_{1:t}\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_{t|t} \end{aligned} $$ as desired. $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 2.3
Here we show Proposition 2.3
By definition of the latent process and Proposition 2.2, $$ \begin{aligned} \Theta_{t+\ell} &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j},\\ \Theta_{t+\ell|t} &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_{t|t}. \end{aligned} $$ Next, the EVC matrix takes the form $$ \begin{aligned} \var\left(\Theta_{t+\ell} - \Theta_{t+\ell|t} \right) &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_t + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} - \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\Theta_{t|t} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] + \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right) \end{aligned} $$ By (P.3), $\cov(\Theta_s, U_t) = 0$ for $s \leq t$. Next, by (A.5), $\cov(U_t, U_s) = 0$ for all $t\neq s$. So that, $$ \begin{aligned} &\var\left(\Theta_{t+\ell} - \Theta_{t+\ell|t} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] \right) + \left( \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] \right) + \var\left( \sum_{j=1}^\ell\left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right)\\ &= \var\left( \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\left[\Theta_t - \Theta_{t|t}\right] \right) + \sum_{j=1}^\ell\var\left( \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\,U_{t+\ell-j} \right)\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)\var\left( \left[\Theta_t - \Theta_{t|t}\right] \right) \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)^\intercal + \sum_{j=1}^\ell \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)\var\left(U_{t+\ell-j}\right) \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)^\intercal\\ &= \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right) \Sigma_{t|t} \left(\prod_{k=1}^\ell \vF_{t+\ell-k}\right)^\intercal + \sum_{j=1}^\ell \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)Q_{t+\ell-j} \left(\prod_{k=1}^{j-1}\vF_{t+\ell-k}\right)^\intercal. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 2.4
Here we show Proposition 2.4.
The proof follows from the definitions of the innovation process, the state-space model (SSM), and the best linear unbiased predictor (BLUP), as well as property (P.4).
We begin by expanding the innovation: $$ \begin{aligned} {\cal E}_t &= Y_t - \sum_{k=1}^{t-1}\cov(Y_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k\\ &= Y_t - \sum_{k=1}^{t-1}\cov(\vH_t\,\Theta_t + E_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k\\ &= Y_t - \vH_t\, \underbrace{\sum_{k=1}^{t-1}\cov(\Theta_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k}_{\Theta_{t|t-1}} - \sum_{k=1}^{t-1} \underbrace{\cov(E_t, {\cal E}_k)}_{=0 \text{ ({P.4})}}\,\var({\cal E}_k)^{-1}\,{\cal E}_k\\ &= Y_t - \vH_t\,\Theta_{t|t-1}. \end{aligned} $$
Next, because $Y_t = \vH_t\,\Theta_t + E_t$, the innovation takes the alternative form $$ \begin{aligned} {\cal E}_t &= Y_t - \vH_t\,\Theta_{t|t-1}\\ &= \vH_t\,\Theta_t + E_t - \vH_t\,\Theta_{t|t-1}\\ &= \vH_t\,(\Theta_t - \Theta_{t|t-1}) + E_t. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 2.5
Here we show Proposition 2.5
Let $t \geq 2$. Following Proposition 2.4, Property (P.2) and the definition of the innovation, we obtain $$ \begin{aligned} \var({\cal E}_t) &= \var\Big(\vH_t(\Theta_t - \Theta_{t|t-1}) + E_t\Big)\\ &= \vH_t\,\var(\Theta_t - \Theta_{t|t-1})\,\vH_t^\intercal + \var(E_t) - \vH_t\,\cov(\Theta_t, E_t) - \cov(E_t, \Theta_t)\,\vH_t^\intercal\\ &= \vH_t\,\var(\Theta_t - \Theta_{t|t-1})\,\vH_t^\intercal + \var(E_t)\\ &= \vH_t\,\Sigma_{t|t-1}\,\vH_t^\intercal + R_t. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$
Proof of Theorem 2.6
Here we show Theorem 2.6.
Begin by noting that, for $t,j \in {\cal T}$, $$ \begin{aligned} \cov(\Theta_t, {\cal E}_j) &= \cov(\Theta_t,\, \vH_j\,[\Theta_j - \Theta_{j|j-1}] + E_j)\\ &= \cov(\Theta_t,\, \Theta_j - \Theta_{j|j-1})\,\vH_j^\intercal + \cov(\Theta_t, E_j)\\ &= \cov(\Theta_t - \Theta_{t|j-1},\, \Theta_j - \Theta_{j|j-1})\,\vH_j^\intercal. \end{aligned} $$ Where the last equality follows from Property (P.2) and the invariance property of covariances to fixed vectors.8 We now require to determine $$ \cov(\Theta_t - \Theta_{t|j-1},\, \Theta_j - \Theta_{j|j-1}) $$ for (I) $j = t$, (II) $j \leq t -1$, and (III) $j \geq t+1$.
Case $j=t$
Take $j=t$, then $$ \cov(\Theta_t - \Theta_{t|t-1},\, \Theta_t - \Theta_{t|t-1}) = \var(\Theta_t - \Theta_{t|t-1}) = \Sigma_{t|t-1}. $$ So that $ \cov(\Theta_t, {\cal E}_j) = \Sigma_{t|t-1}\,\vH^\intercal $ as desired.
Case $j\leq t-1$
Take $t - j = \ell \geq 1$, then $j + \ell = t$. By definition of the latent process and Proposition 2.2, $$ \begin{aligned} \Theta_t &= \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Theta_j + \sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,U_{t-k},\\ \Theta_{t|j-1} &= \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Theta_{j|j-1} \end{aligned} $$
Then, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|j-1},\, \Theta_j - \Theta_{j|j-1})\\ =&\cov\left( \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Theta_j + \sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,U_{t-k} - \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Theta_{j|j-1},\, \Theta_j - \Theta_{j|j-1} \right)\\ =&\cov\left( \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Theta_j - \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Theta_{j|j-1},\, \Theta_j - \Theta_{j|j-1} \right) + \cov\left(\sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,U_{t-k},\, \Theta_j - \Theta_{j|j-1} \right)\\ =&\left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\cov\left( \Theta_j - \Theta_{j|j-1},\, \Theta_j - \Theta_{j|j-1} \right) + \sum_{k=1}^{t-j}\left(\prod_{m=1}^{k-1}\vF_{t-m}\right)\,\cov(U_{t-k}, \Theta_j)\\ =&\left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\var\left( \Theta_j - \Theta_{j|j-1} \right)\\ =&\left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\,\Sigma_{j|j-1}, \end{aligned} $$ where $\cov(U_{t-k}, \Theta_j) = 0$ follows from property (P.3).
Hence $ \cov(\Theta_t, {\cal E}_j) = \left(\prod_{k=1}^{t-j}\vF_{t-k}\right)\Sigma_{t|t-1}\,\vH^\intercal $ as desired.
Case $j \geq t+1$
Take $j - t = \ell \geq 1$, then $j = t + \ell$. Following $({\rm BLUP.2})$, we have $$ \tag{BLUP.3} \begin{aligned} \Theta_{t|t-1} &= \sum_{k=1}^{t-1}\cov(\Theta_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k,\\ \Theta_{t|t} &= \sum_{k=1}^{t-1}\cov(\Theta_t, {\cal E}_k)\,\var({\cal E}_k)^{-1}\,{\cal E}_k + \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\\ &= \Theta_{t|t-1} + \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}. \end{aligned} $$ Next, denote $$ \vM_{t} = \vF_{t}[\vI - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t]. $$
The rest of the proof follows by induction.
Show $j=t+1$
First, suppose $j=t+1$. Then $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t}, \Theta_{t+1} - \Theta_{t+t1|t})\\ &=\cov(\Theta_t - \Theta_{t|t}, \vF_t\,\Theta_t + U_t - \vF_t\,\Theta_{t|t})\\ &= \cov(\Theta_t - \Theta_{t|t}, \Theta_t - \Theta_{t|t})\,\vF_t^\intercal, \end{aligned} $$ where we used Property (P.3). Next, following $({\rm BLUP.3})$, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t}, \Theta_t - \Theta_{t|t})\\ &= \var\Big(\Theta_t - \Theta_{t|t-1} - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\Big)\\ &= \var\Big(\Theta_t - \Theta_{t|t-1}\Big)\\ &\quad + \var\Big(\cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\Big)\\ &\quad - \cov\Big(\Theta_t - \Theta_{t|t-1}, \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t\Big)\\ &\quad - \cov\Big(\cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,{\cal E}_t,\,\Theta_t - \Theta_{t|t-1})\\ &=\var\Big(\Theta_t - \Theta_{t|t-1}\Big)\\ &\quad + \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\,\var({\cal E})\,\var({\cal E}_t)^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &\quad - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_{t})^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &\quad - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_{t})^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &=\var\Big(\Theta_t - \Theta_{t|t-1}\Big) - \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_{t})^{-1}\,\cov({\cal E}_t, \Theta_t)\\ &= \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}. \end{aligned} $$ From the last two equations, we have $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t}, \Theta_{t+1} - \Theta_{t+t1|t})\\ &=\cov(\Theta_t - \Theta_{t|t}, \Theta_t - \Theta_{t|t})\,\vF_t^\intercal\\ &= \Big(\Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}\Big)\,\vF_t^\intercal\\ &= \Sigma_{t|t-1}\,\Big(\vI -\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}\Big)\,\vF_t^\intercal\\ &= \Sigma_{t|t-1}\,\Big(\vF_t\,[\vI -\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}]^\intercal\Big)^\intercal\\ &= \Sigma_{t|t-1}\,\Big(\vF_t\,[\vI - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t]\Big)^\intercal\\ \end{aligned} $$
Suppose true for $j=t+k$
Suppose, for $j=t+k$, $$ \cov(\Theta_t - \Theta_{t|t+k-1}, \Theta_{t+k} - \Theta_{t+k | t+k-1}) = \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right). $$ So that $$ \cov(\Theta_t, {\cal E}_{t+k}) = \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right)\,\vH_{t+k}^\intercal $$
Show for $j=t+k+1$
Take $j = t+k+1$, then $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k+1} - \Theta_{t+k+1|t+k})\\ &=\cov(\Theta_t - \Theta_{t|t+k}, \vF_{t+k}[\Theta_{t+k} + U_{t+k} - \Theta_{t+k|t+k}])\\ &=\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\,\vF_{t+k}^\intercal. \end{aligned} $$ Next, following $({\rm BLUP.3})$ we have $$ \begin{aligned} \Theta_{t|t+k} &= \Theta_{t|t+k-1} + \cov(\Theta_t, {\cal E}_{t+k})\,\var({\cal E})^{-1}{\cal E}_{t+k} \\ \Theta_{t+k|t+k} &= \Theta_{t+k|t+k-1} + \cov(\Theta_{t+k}, {\cal E}_{t+k})\,\var({\cal E}_{t+k})^{-1}\,{\cal E}_{t+k}\\ \end{aligned} $$
So that, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\\ &=\cov( \Theta_t - \Theta_{t|t+k-1} + \cov(\Theta_t, {\cal E}_{t+k})\,\var({\cal E})^{-1}{\cal E}_{t+k},\, \Theta_{t+k} - \Theta_{t+k|t+k-1} + \cov(\Theta_{t+k}, {\cal E}_{t+k})\,\var({\cal E}_{t+k})^{-1}\,{\cal E}_{t+k})\\ &= \cov(\Theta_t - \Theta_{t|t+k-1},\,\Theta_{t+k} - \Theta_{t+k|t+k-1}) - \cov(\Theta_t, {\cal E}_{t+k})S_{t+1}^{-1}\,\cov({\cal E}_{t+k}, \Theta_{t+k}). \end{aligned} $$
Then, the induction hypothesis, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) - \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right)\,\vH_{t+k}^\intercal\,S_{t+k}^{-1}\,\vH_{t+k}\,\Sigma_{t+k|t+k-1}\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vI - \vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}\Sigma_{t+k|t+k-1}\Big) \end{aligned} $$
Finally, $$ \begin{aligned} &\cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k+1} - \Theta_{t+k+1|t+k})\\ &= \cov(\Theta_t - \Theta_{t|t+k}, \Theta_{t+k} - \Theta_{t+k|t+k})\,\vF_{t+k}^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vI - \vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}\Sigma_{t+k|t+k-1}\Big)\,\vF_{t+k}^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vF_{t+k}[\vI - \vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}\Sigma_{t+k|t+k-1}]^\intercal\Big)^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k}\vM_{t+\ell}^\intercal\right) \Big(\vF_{t+k}[\vI - \Sigma_{t+k|t+k-1}\,\vH_{t+k}^\intercal\,S_{t+k}\,\vH_{t+k}]\Big)^\intercal\\ &= \Sigma_{t|t-1}\,\left(\prod_{\ell=1}^{k+1}\vM_{t+\ell}^\intercal\right) \end{aligned} $$ as desired.
$$ \ \tag*{$\blacksquare$} $$Proof of Proposition 3.1
Here we show Proposition 3.1.
By definition, $$ \begin{aligned} \Sigma_{1|0} &= \var(\Theta_1)\\ &= \var(\vF_0\,\Theta_0 + U_0)\\ &= \vF_0\,\var(\Theta_0)\,\vF_0^\intercal + \var(U_0), \end{aligned} $$ where the last equality follows from assumption (A.8)—$\cov(\Theta_0, U_t) = 0$ for all $t$. $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 3.2
Here we show Proposition 3.2.
Recall $\Sigma_{t|t-1} = \var(\Theta_t - \Theta_{t|t-1})$.
By definition $({\rm SSM.1})$ and Proposition 2.2., the difference between the latent variable and the predicted latent variable takes the form $$ \begin{aligned} \Theta_t - \Theta_{t|t-1} &= \vF_{t-1}\,\Theta_{t-1} + U_{t-1} - \vF_{t-1}\,\Theta_{t-1|t-1}\\ &= \vF_{t-1}\,\left(\Theta_{t-1} - \Theta_{t-1|t-1}\right) + U_{t-1}. \end{aligned} $$
Next, following Property (P.3) and Assumption (A.2), the EVC takes the form $$ \begin{aligned} &\Sigma_{t|t-1}\\ &=\var(\Theta_t - \Theta_{t|t-1})\\ &= \var\Big(\vF_{t-1}\,\left[\Theta_{t-1} - \Theta_{t-1|t-1}\right] + U_{t-1}\Big)\\ &= \vF_{t-1}\,\var(\Theta_{t-1} - \Theta_{t-1|t-1})\,\vF_{t-1}^\intercal + \var(U_{t-1})\\ &= \vF_{t-1}\,\Sigma_{t-1|t-1}\,\vF_{t-1}^\intercal + Q_{t-1}. \end{aligned} $$ $$ \ \tag*{$\blacksquare$} $$
Proof of Proposition 3.3
Here we show Proposition 3.3.
Following $({\rm EVC.1})$ in Proposition 2.1, we have $$ \begin{aligned} \Sigma_{t|t} &= \var(\Theta_t) - \sum_{k=1}^t \vK_{t,k}\,\var({\cal E}_k)\,\vK_{t,k}^\intercal\\ \Sigma_{t|t-1} &= \var(\Theta_t) - \sum_{k=1}^{t-1} \vK_{t,k}\,\var({\cal E}_k)\,\vK_{t,k}^\intercal, \end{aligned} $$ with $\vK_{t,j} = \cov(\Theta_t, {\cal E}_j)\,\var({\cal E}_j)^{-1}$.
Next, taking the difference between the predicted EVC and the EVC at time $t$, we obtain $$ \begin{aligned} \Sigma_{t|t} - \Sigma_{t|t-1} &= -\vK_{t,t}\,\var({\cal E}_t)\,\vK_{t,t}^\intercal. \end{aligned} $$
To obtain the desired form, let $\vK_t = \vK_{t,t} = \cov(\Theta_t, {\cal E}_t)\var({\cal E}_t)^{-1}$. Following Theorem 2.6 with $j=t$, we obtain $ \cov(\Theta_t, {\cal E}_t) = \Sigma_{t|t-1}\,\vH_t^\intercal $. So that $$ \begin{aligned} \vK_t &= \cov(\Theta_t, {\cal E}_t)\,\var({\cal E}_t)^{-1}\\ &= \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}. \end{aligned} $$ Finally, we obtain $$ \begin{aligned} \vK_t &= \Sigma_{t|t-1}\,\vH_t^\intercal\,S_{t}^{-1},\\ \Sigma_{t|t} &= \Sigma_{t|t-1} - \vK_t\,S_t\,\vK_t^\intercal, \end{aligned} $$ and alternatively $$ \Sigma_{t|t} = \Sigma_{t|t-1} - \Sigma_{t|t-1}\,\vH_t^\intercal\,S_t^{-1}\,\vH_t\,\Sigma_{t|t-1}. $$ $$ \ \tag*{$\blacksquare$} $$
References and further reading
- Svetunkov, Ivan. Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM), Section 8.1. United States, CRC Press, 2023.
- Nau, Robert. “Statistical forecasting: notes on regression and time series analysis. 2019.” URL https://people.duke.edu/~rnau/411arim.htm. (2019).
I like to think of SSMs as the equivalent of Plato’s cave for statistical models: our platonic object are the latent variables, which we never get to see, and the noisy observations are the distorted shadows of the object. ↩︎
Battin, Richard H. “Space guidance evolution-a personal narrative.” Journal of Guidance, Control, and Dynamics 5.2 (1982): 97-110. ↩︎
See https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity ↩︎
Eubank, Randall L. A Kalman filter primer. Chapman and Hall/CRC, 2 ↩︎
See Zivot, Eric. “State Space Models and the Kalman Filter.” Apr 9 (2006): 1-8. ↩︎
We defined each of these quantities as the best linear unbiased predictor (BLUP) for the signal at time $t$, given measurements up to time $j$. ↩︎
See the inverse of a lower triangular matrix is lower triangular on math.stack exchange. ↩︎
See property 4 in https://en.wikipedia.org/wiki/Covariance#Covariance_of_linear_combinations ↩︎