Do Gaussian processes really need Bayes?
Introduction
I was first introduced to Gaussian Processes (GPs) as the quintessential example of Bayesian reasoning in machine learning. This perspective is cemented by the seminal textbook on the subject, which states in its foreword:
“Gaussian Processes for Machine Learning presents one of the most important Bayesian machine learning approaches.” 1
And yet, the formal definition of a GP in the same book makes no mention of Bayes at all. Separately, in Pattern Recognition and Machine Learning, Christopher Bishop offers an interesting observation:
Kalman filters (KF) […] can be viewed as forms of Gaussian process models." 2
This connection is revealing. A GP is mathematically equivalent to a KF under certain conditions,3 4 and the KF does not rely on Bayesian inference to be derived.5
This raises the question: is Bayesian reasoning essential for making predictions with Gaussian Processes?
Bayes GPs, generalised-Bayes GPs, and without-Bayes GPs
GPs in machine learning have traditionally been understood through a Bayesian lens, but recent advances on Generalised Bayes (genBayes) methods have opened the door to alternative interpretations.
In a nutshell, a GP maps a sequence of seen features and target variables (train set) to a sequence of latent, or unknown, target variables whose features are known (test set). From a Bayesian point of view, predictions are obtained by leveraging assumptions of Gaussianity, a known covariance structure between features, and the Bayesian posterior predictive. This process typically involves the specification of a prior (via the kernel), a likelihood function, and the computation of probability densities based on conditioning and marginalisation.
However, as elegant as methods under the Bayesian formulations are, they are not without severe drawbacks, such as sensitivity to model misspecification and significant computational demands. These limitations have sparked growing interest developing methods from a genBayes perspective. This perspective reformulates Bayesian inference as an optimisation problem. By relaxing the strict assumptions of traditional Bayesian methods—such as the knowledge of a well-specified likelihood and prior—genBayes offers a more flexible approach that can adapt to a wider range of scenarios. 6 For example, a recent paper7 applies the principles of genBayes to derive a robust (to outliers) Gaussian Process prediction method that is of similar computational cost to the classical GP.
In this post, we take a step further: removing Bayes altogether. The approach presented here does not rely on Bayes rule (conjugacy / marginalisation) or Gaussianity (by definition). The only requirement is the choice of a loss function and a model for the first two moments of the target variables.
Disclaimers and a small literature review*
There is a long and tangled history in statistics and probabilistic machine learning regarding the benefits and drawbacks of the Bayesian paradigm versus the frequentist paradigm.
In the Bayesian paradigm, the idea of placing a prior over functions and computing the posterior after observing data has been attributed to Poincaré in 1896, as noted in Persi Diaconi’s paper “Bayesian numerical analysis”. 8 However, the modern Bayesian formulation is more likely due to the work by George S. Kimeldorf and Grace Wahba in 1970. 9
In the frequentist paradigm, which is the approach we take in this post, the idea of finding the best linear predictor for a sequence of values given a training dataset can be traced back to the work of Kolmogorov in the 1940s.10 11 12 This frequentist approach avoids the need for distributional assumptions over functions, and simply works with random variables whose first and second moments are assumed to be known. A related frequentist work is by Dennis D. Cox on Best Unbiased Prediction for Gaussian and Log-Gaussian Processes.13 However, this work restricts itself to the case of a unidimensional feature-space, whereas we consider the general class of feature spaces found in GPs.
Some readers might recognise the terminology here as being reminiscent of Kriging. However, Kriging is primarily associated with spatial interpolation in two- to three-dimensional feature-spaces.14 15 It also differs from kernel ridge regression in that we do not require the kernel trick to derive this method.16
Finally, there are already excellent tutorials on GPs,17 18 and I don’t believe Bayes should be abandoned in GPs. This post is simply an exercise in curiosity.
Bayes-free Gaussian Processes
Signal-only: the interpolation case
Let $F: {\cal X}\subseteq \reals^m \to \reals$ be a (signal) stochastic process, meaning that for each $x \in {\cal X}$, $F(x)$ is a random variable.19 Next, suppose that for any two $(x_i, x_j)$ both in ${\cal X}$, $$ \begin{aligned} \mathbb{E}[F(x_i)] &= 0,\\ \cov(F(x_i),F(x_j)) &= k(x_i,\,x_j), \end{aligned} $$ where $k:{\cal X}\times{\cal X}\to\reals$ is a valid kernel (covariance function).20 For simplicity, denote $F_t = F(x_t)$.
Suppose we are given the time-ordered sequence of features $x_{1:t+j} = (x_1, \ldots, x_t, x_{t+1}, \ldots, x_{t+j})$ and past signals $F_{1:t} = (F_1, \ldots, F_t)$. Our goal is to find the best (in an $L_2$ sense) linear unbiased predictor (BLUP) for future signals $F_{t+1:t+j} = (F_{t+1}, \ldots F_{t+j})$.
To determine the BLUP for $F_{t+1:t+j}$, we first estimate the gain matrix $\vK_* \in \reals^{j\times t}$ which maps the past signals $F_{1:t}$ to the target signals $F_{1:t+1}$. Mathematically, the gain matrix is given by $$ \tag{GAIN.1} \vK_* = \arg\min_{\vK}\mathbb{E}[\|F_{t+1:t+j} - \vK\,F_{1:t}\|^2_2] = \cov(F_{t+1:t+j},F_{1:t})\,\var(F_{1:t})^{-1}. $$ For a derivation of $\vK_*$, see the Appendix.
Because $x_{1:t+j}$ are observed, and the covariance between any two $(F_i, F_j)$ depends only on $(x_i, x_j)$, it follows that $\vK_*$ is readily computed from $x_{1:t+j}$. We will see an example of this below.
From random processes to predictions
Suppose we are given a set of features $x_{1:t+j}$ and a sample of observations $f_{1:t}$ from $F_{1:t}$. We make use of $({\rm GAIN.1})$ to compute the best estimate for the future signals $F_{t+1:t+j}$ (BLUP), as well as the error variance-covariance (EVC) which quantifies the uncertainty about the estimated BLUP.
What this means in practice is that, having computed $\vK_*$, prediction and uncertainty quantification follows from simple linear algebra: $$ \tag{BLUP.1} \boxed{ \begin{aligned} \vK_* &= \cov(F_{t+1:t+j},F_{1:t})\,\var(F_{1:t})^{-1},\\ \mu_{*} &= \vK_*\,f_{1:t},\\ \Sigma_{*} &= \var(F_{t+1:t+j}) - \vK_*\,\var(F_{1:t})\,\vK_*^\intercal. \end{aligned} } $$ Here, $\mu_*$ is the BLUP and $\Sigma_{*} = \var(F_{t+1:t+j} - \vK_*\,F_{1:t})$ is the EVC.
Example: function interpolation
The special case where future values are all past seen values, i.e., $j=t$ and $F_{t+1:t+j} = F_{1:t}$, yields $$ \begin{aligned} \vK^* &= \cov(F_{t+1:t+j},F_{1:t})\,\var(F_{1:t})^{-1}\\ &= \cov(F_{1:t},F_{1:t})\,\var(F_{1:t})^{-1}\\ &= \var(F_{1:t})\,\var(F_{1:t})^{-1}\\ &= \vI_t. \end{aligned} $$ As a consequence $$ \begin{aligned} \hat{F}_{t+1:t+j} &= \vK^*\,F_{1:t} = \vI_t\,F_{1:t} = F_{1:t},\\ \var(F_{t+1:t+j} - \hat{F}_{t+1:t+j}) &= \var(F_{t+1:t+j}) - \vK^*\,\var(F_{1:t})\vK^{*\intercal} = \var(F_{1:t}) - \var(F_{1:t}) = 0. \end{aligned} $$ Intuitively, we see that the BLUP interpolates already seen values and sets their estimated variance to be zero.
To give a concrete example, consider the function $$ f(x) = \sin(4\,x) + \cos(2\,x) $$ and suppose we are given the sequence of values $(f_{1:T}, x_{1:T})$, where $x_t \sim {\cal U}[-4,4]$, $f_t = f(x_t)$, and $T=15$. We seek to estimate $\mu_*$ for future values $F_{\vx}$, with $\vx$ a uniform grid taking values in $[-4, 4]$.

As a modelling choice, take the covariance between two signals to be $$ \cov(F_i, F_j) = k(x_i, x_j) = \exp\left(-\frac{1}{2\,\gamma^2}(x_i - x_j)^2\right). $$
Then, estimation of the mean $F_\vx$ and its error variance, given $f_{1:T}$, is $$ \begin{aligned} \vK_\vx &= \cov(F_\vx, F_{1:T})\,\var(F_{1:T})^{-1},\\ \mu_\vx &= \vK_\vx\,f_{1:T},\\ \Sigma_\vx &= \var(F_\vx) - \vK_\vx\,\var(F_{1:T})\,\vK_\vx^\intercal. \end{aligned} $$
Coding the predictions
To see $({\rm BLUP.1})$ in practice, consider the following Python code which implements prediction and uncertainty quantification using the example above.
First, define the train and test datasets, as well as the Kernel of choice:
|
|
Next, a prediction over $\vx$ for values $F_\vx$ is obtained following $({\rm BLUP.1})$.
|
|
Having computed mu_pred
($\equiv \mu_\vx$) and sigma_pred
($\equiv \Sigma_\vx$)
allows us to determine the expected value of the sequence at every point $x \in \vx$.
To be precise, take $x \in \vx \subseteq [-4,4]$. The mean and error variance estimation at the point $x$ takes the form $$ \begin{aligned} \mu_x &= \vK_x\,f_{1:T} \in \reals,\\ \sigma_x^2 &= k(x, x) - \vK_x\,\var(F_{1:T})\,\vK_x^\intercal \in \reals^+. \end{aligned} $$
The following plot shows the values $\mu_x$ and $\mu_x \pm 2\sigma_x$ for $x \in \vx$.
Intuitively, we see that the uncertainty increases as we move away from points close to $f_{1:t}$
and decreases otherwise.
Next, the following plot shows the BLUP estimation over $\vx \subseteq [-4,4]$ given $f_{1:t}$ for varying levels of $t$.

Signal-plus-noise: the regression case
Suppose now that we are presented samples from the sequence of noisy measurements $Y_{1:t}$ that are composed of a signal component $F_{1:t}$ and a noise component $E_{1:t}$. We seek to make use of the sampled $y_{1:t}$ from $Y_{1:t}$, together with $x_{1:t+j}$, to make predictions about the value of the signals $F_{t+1:t+j}$ at test points $x_{t+1:t+j}$.
Formally, let $E: {\cal X} \to \reals$ be a random process such that, for any two $(x_i, x_j)$ both in ${\cal X}$, $$ \begin{aligned} \mathbb{E}[E(x_i)] &= 0,\\ \cov(E(x_i),E(x_j)) &= g(x_i\,x_j)\,\delta(i-j),\\ \cov(F(x_i), E(x_j)) &= 0. \end{aligned} $$ where $g:{\cal X}\times{\cal X}\to\reals$ is a valid kernel function 20 and $\delta(\cdot)$ is the delta function.
Next, define the signal-plus-noise (SPN) random process $Y: {\cal X} \to \reals$ such that, for $x_t \in {\cal X}$, $$ \begin{aligned} Y(x_t) = F(x_t) + E(x_t). \end{aligned} $$ Following our notation above, write $$ Y_t = F_t + E_t. $$
Next, consider a sequence of measurements $Y_{1:t} = (Y_1, \ldots, Y_t)$ sampled at points $x_{1:t}$. We seek to evaluate the BLUP for the signal process at test points $x_{t+1:t+j}$.
In this case, the gain matrix takes the form $$ \begin{aligned} \vK_* &= \cov(F_{t+1:t+j}, Y_{1:t})\,\var(Y_{1:t})^{-1}\\ &= \cov(F_{t+1:t+j}, F_{1:t} + E_{1:t})\,\var(F_{1:t} + E_{1:t})^{-1}\\ &= \cov(F_{t+1:t+j}, F_{1:t})\,\big(\var(F_{1:t}) + \var(E_{1:t}) + 2\,\cov(E_{1:t}, F_{1:t})\big)^{-1}\\ &= \cov(F_{t+1:t+j}, F_{1:t})\,\big(\var(F_{1:t}) + {\rm diag}[\var(E_{1}), \ldots, \var(E_t)])\big)^{-1}. \end{aligned} $$ In the special case where $\var(E_t) = \sigma^2$ for all $t$ (error i.i.d assumption), and we are given samples $y_{1:t}$ from $Y_{1:t}$, the BLUP for $F_{t+1:t+j}$ is given by $$ \tag{BLUP.2} \boxed{ \begin{aligned} \vK_* &= \cov(F_{t+1:t+j}, F_{1:t})\,\big(\var(F_{1:t}) + \sigma^2\,\vI_t\big)^{-1},\\ \mu_{*} &= \vK_*\,y_{1:t},\\ \Sigma_{*} &= \var(F_{t+1:t+j}) - \vK_*\,\var(F_{1:t})\,\vK_*^\intercal. \end{aligned} } $$ $$ $$ which corresponds to the mean and variance of the Gaussian posterior predictive density often found in standard GP textbooks.
Example: Gaussian process regression
Consider the function $f(x) = \sin(4\,x) + \cos(2\,x)$ and measurements given by $$ y(x) = f(x) + E(x), $$ with $E(x) \sim {\cal N}(0, 0.2)$. Similar as before, suppose we are given the sequence of values $(f_{1:T}, x_{1:T})$, where $x_t \sim {\cal U}[-4,4]$, $f_t = f(x_t)$, and $T=15$. We seek to estimate $\mu_*$ for future values $F_{\vx}$, with $\vx$ a uniform grid taking values in $[-4, 4]$.
The following code shows the definition of the training and test datasets. Note that we consider the Kernel defined in the previous block of code.
|
|
The following plot shows the true underlying unknown function as well as the noisy measurements.
Then, regression is performed following $({\rm BLUP.2})$. In code, this takes the form
|
|
The following plot show the expected (under the BLUP) value of the signal and the estimate around two standard deviations,
that is $\mu_*$ and $\mu_* \pm 2\sigma$.
Example: Gamma-like-process regression
Because $({\rm BLUP.2})$ is the best estimate as long as we know $\sigma^2$ and the kernel function $k$, we can consider any other distribution whose first-moment is zero and second moment is $\sigma^2$ and will still be best under the expected squared metric.
As an example, suppose now that $E_t$ is sampled from a zero-mean-centred Gamma distribution with shape parameter $0.5$ and scale / rate parameter set to $1.0$, so that $\sigma^2 = 0.5$. In this scenario, the predictions are not optimal, but they are still best under our choice of loss function and Kernel. We show the estimated values and the uncertainty estimation in the Figure below.

Conclusion
In this post, we showed Bayes rule is not require to derive the mean and variance of a Gaussian process regression posterior predictive. Furthermore, we showed that, under a best-linear-unbiased predictor point of view, the only assumptions to derive the Gaussian process regression are the assumptions of a signal plus noise model with known covariance function (or kernel) for the signal, and a zero-mean (i.i.d) Gaussian process for the noise.
Acknowledgments
Thanks to Alex Shestopaloff for providing valuable comments and the reference to Kolmogorov’s work, to Fayçal Drissi for comments on an earlier version of this post, and to F-X Briol for providing the reference to Kimeldorf and Wahba.
Appendix
Information gain matrix
Here, we derive the gain matrix shown in the main text. $$ \vK_* = \arg\min_{\vK}\mathbb{E}[\|F_{t+1:t+j} - \vK\,F_{1:t}\|^2_2] = \cov(F_{t+1:t+j},F_{1:t})\,\var(F_{1:t})^{-1}. $$
Let $$ {\cal L}(\vK) = \mathbb{E}[\|F_{t+1:t+j} - \vK\,F_{1:t}\|^2_2]. $$ Because ${\cal L}$ is linear and quadratic with respect to $\vK$, it follows that $\argmin_\vK {\cal L}(\vK)$ is unique. To find the minimiser, note that $$ \begin{aligned} {\cal L}({\vK}) &= \mathbb{E}[ \Big( (F_{t+1:t+j} - \vK\,F_{1:t})^\intercal\,(F_{t+1:t+j} - \vK\,F_{1:t}) \Big) ]\\ &= \mathbb{E}\left[ F_{t+1:t+j}^\intercal\,F_{t+1:t+j} - F_{t+1:t+j}^\intercal\,\vK\,F_{1:t} - F_{1:t}^\intercal\,\vK^\intercal\,F_{t+1:t+j} + F_{1:t}^\intercal\,\vK^\intercal\,\vK\,F_{1:t} \right]\\ &= \mathbb{E}\left[ F_{t+1:t+j}^\intercal\,F_{t+1:t+j} - {\rm Tr}(\vK\,F_{1:t}\,F_{t+1:t+j}^\intercal) - {\rm Tr}(\vK^\intercal\,F_{t+1:t+j}\,F_{1:t}^\intercal) + {\rm Tr}(\vK^\intercal\,\vK\,F_{1:t}\,F_{1:t}^\intercal) \right]\\ \end{aligned} $$ So that $$ \begin{aligned} \nabla_\vK{\cal L}(\vK) &= \mathbb{E}\left[ - \nabla_\vK {\rm Tr}(\vK\,F_{1:t}\,F_{t+1:t+j}^\intercal) - \nabla_\vK {\rm Tr}(\vK^\intercal\,F_{t+1:t+j}\,F_{1:t}^\intercal) + \nabla_\vK {\rm Tr}(\vK^\intercal\,\vK\,F_{1:t}\,F_{1:t}^\intercal) \right]\\ &= \mathbb{E}\left[ - F_{t+1:t+j}\,F_{1:t}^\intercal - F_{t+1:t+j}\,F_{1:t}^\intercal + \vK\,(F_{1:t}\,F_{1:t}^\intercal)^\intercal + \vK\,(F_{1:t}\,F_{1:t}^\intercal) \right]\\ &= \mathbb{E}\left[ - 2\,F_{t+1:t+j}\,F_{1:t}^\intercal + 2\vK\,(F_{1:t}\,F_{1:t}^\intercal) \right]\\ &= -2\,\mathbb{E}[F_{t+1:t+j}\,F_{1:t}^\intercal] +2\,\vK\,\mathbb{E}[F_{1:t}\,F_{1:t}^\intercal]\\ &= -2\,\cov(F_{t+1:t+j}, F_{1:t}) + 2\,\vK\,\var(F_{1:t}), \end{aligned} $$ where the second equality follows from equations $(100)$, $(103)$, and $(113)$ in the Matrix Cookbook.21 Setting $\nabla_\vK{\cal L}(\vK)$ to zero and solving for $\vK$ yields $\vK_*$.
Williams, Christopher KI, and Carl Edward Rasmussen. Gaussian processes for machine learning. Vol. 2. No. 3. Cambridge, MA: MIT press, 2006. ↩︎
Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006. ↩︎
See e.g., slide 18 in Gaussian Processes for Sequential Prediction ↩︎
Reece, Steven, and Stephen Roberts. “An introduction to Gaussian processes for the Kalman filter expert.” 2010 13th International Conference on Information Fusion. IEEE, 2010. ↩︎
See e.g., Filtering notes (III): the Kalman filter ↩︎
See e.g., Generalising Bayes ↩︎
Altamirano, Matias, François-Xavier Briol, and Jeremias Knoblauch. “Robust and conjugate Gaussian process regression.” arXiv preprint arXiv:2311.00463 (2023). ↩︎
Diaconis, Persi. “Bayesian numerical analysis.” Statistical decision theory and related topics IV 1 (1988): 163-175. ↩︎
See Section 4 in Kimeldorf, George S., and Grace Wahba. “A correspondence between Bayesian estimation on stochastic processes and smoothing by splines.” The Annals of Mathematical Statistics 41.2 (1970): 495-502. ↩︎
Kolmogorov, Andrey. “Interpolation and extrapolation of stationary random sequences.” Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya 5 (1941): 3. ↩︎
Shiryayev, A. N. (Ed.). (1992). Selected works of A. n. kolmogorov: Volume II probability theory and mathematical statistics. Springer. Chapter 28. URL: https://link.springer.com/book/10.1007/978-94-011-2260-3 ↩︎
Berlinet, Alain, and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. Page 75. ↩︎
Cox, Dennis D. “Best unbiased prediction for Gaussian and log-Gaussian processes.” Lecture Notes-Monograph Series (2004): 125-132. ↩︎
Suryasentana, S. K., and B. B. Sheil. “Demystifying the connections between Gaussian Process Regression and Kriging.” 9th International SUT OSIG Conference. 2023. ↩︎
Marinescu, Marius. “Explaining and Connecting Kriging with Gaussian Process Regression.” arXiv preprint arXiv:2408.02331 (2024). ↩︎
See e.g., Section 17.3.9 in Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022. ↩︎
Formally, the stochastic process should be written as $F(x, \omega)$, where $\omega \in \Omega$ is an outcome from a probability space. For each fixed $x$, $F(x, \cdot)$ is a real-valued random variable. In practice, we omit $\omega$ and simply write $F(x).$ ↩︎
Section 6.2 in Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006. ↩︎ ↩︎
Petersen, Kaare Brandt, and Michael Syskind Pedersen. “The matrix cookbook.” Technical University of Denmark 7.15 (2008): 510. ↩︎