Impact Statement
This article addresses scalability issues with one of the core algorithms in numerical weather prediction systems. Solving these issues contributes to producing higher resolution and more frequently updated weather forecasts. Improved forecasts are an important tool for mitigating and adapting to climate change, with applications, such as predicting the output of wind and solar power, and warning about extreme weather events.
1. Introduction
Data assimilation (DA) is a core component of numerical weather prediction (NWP) systems. The goal of DA is to provide the best estimate of the current state of the dynamical system (the atmosphere in NWP). This is achieved by combining the forecasted state of the system with measurements, for example, satellites (tracks of which can be seen in Figure 1), sensors on the ground and weather balloons. A corrected forecast is then produced using the resultant estimate as the initial condition. Due to the number of observations and dimensionality of the state, DA consumes a large amount of computation time and memory.
In recent years, the scalability of DA approaches has been pushed to the limit, with operational weather centers launching programs to solve the problem (Bauer et al. Reference Bauer, Quintino, Wedi, Bonanni, Chrust, Deconinck, Diamantakis, Düben, English, Flemming, Gillies, Hadade, Hawkes, Hawkins, Iffrig, Kühnlein, Lange, Lean, Marsden, Müller, Saarinen, Sarmany, Sleigh, Smart, Smolarkiewicz, Thiemert, Tumolo, Weihrauch, Zanna and Maciel2020). Several challenges are identified. First, the amount of data that must be processed has grown dramatically due to the increasing availability of satellite observations and the higher resolution of forecasting models. Second, while the total computational power available continues to increase, this is no longer due to individual cores getting faster but instead in the form of expanded parallelism. Thus, DA algorithms must be able to take full advantage of parallel and distributed hardware.
4D-Var and three-dimensional (3D)-Var, and their derivatives, are the DA algorithms used by many systems for weather forecasting and other applications, such as ocean simulation (Bonavita and Lean Reference Bonavita and Lean2021; Saulter et al. Reference Saulter, Bunney, King and W2020). A standard approach to distribute these algorithms across many compute nodes is spatial parallelism using domain decomposition (D’Amore et al. Reference D’Amore, Arcucci, Carracciuolo and Murli2014; Arcucci et al. Reference Arcucci, D’Amore, Celestino, Scotti and Laccetti2015). The geographic area covered by the model is divided into overlapping subdomains; each subdomain is assigned to a single compute node that solves the assimilation problem in that subdomain. A key limitation of this approach is that the overlapping regions between subdomains must be carefully synchronized between nodes to ensure that the states computed on each node are physically consistent with each other. These synchronization steps introduce a bottleneck due to communication overheads, and this can be exacerbated by any computing load imbalance between the subdomains. Although there have been attempts to relax the level of synchronization required (Cipollone et al. Reference Cipollone, Storto and Masina2020), ideally synchronization requirements would be removed altogether.
In this article, we propose an alternative approach to DA that is designed with distributed computation in mind. We exploit the formulation of DA as a Bayesian inference problem (Evensen et al. Reference Evensen, Vossepoel and van Leeuwen2022), which allows us to apply tools from the large-scale Bayesian inference literature. In particular, we develop a method based on considering DA as inference in a Gaussian Markov random field (GMRF) and develop a message-passing algorithm to perform inference on this field. This approach naturally supports domain decomposition across multiple nodes without requiring overlapping regions and, therefore, no synchronization is required. Only a small amount of data must be communicated between subdomains, and this can be performed asynchronously with the computation on each node. In this initial work, we consider only two-dimensional spatial inference problems, rather than the 3D or 3D-with-time problems commonly seen in applications. However, the framework we develop here is designed to be extended to the full 3D-with-time case. The contributions of this work are as follows:
-
• We propose an approach for expressing the DA problem as a message-passing algorithm.
-
• We develop a GPU-accelerated implementation for maximum a posteriori (MAP) inference, which naturally supports distributed computation for very large domains.
-
• We demonstrate the efficacy of our algorithm on surface temperature data and demonstrate that our approach is a viable DA technique.
-
• We also include a fast, GPU-accelerated implementation of 3D-Var as a very strong baseline.
We release our message passing and 3D-Var implementations, and code to reproduce our experiments, at github.com/oscarkey/message-passing-for-da.
2. Background
In this section, we recall the Bayesian formulation of DA and discuss several inference algorithms. In particular, we introduce the message-passing algorithm that we apply in this work.
2.1. Data assimilation as Bayesian inference
DA comprises a set of techniques in NWP that aim to combine earth observations with assumptions about the state of the weather to produce an updated estimate of the weather. In this work, we simplify the full DA problem to spatial inference of a single variable. Let $ \boldsymbol{f}=\left({f}_1,\dots, {f}_n\right)\in {\mathrm{\mathbb{R}}}^n $ be a weather variable, such as surface temperature, that is discretized on a large 2D spatial grid consisting of points $ {\boldsymbol{x}}_1,\dots, {\boldsymbol{x}}_n $ . We also have $ m $ observations, $ \boldsymbol{y}\in {\mathrm{\mathbb{R}}}^m $ , which may either be derived from remote sensing products or direct observations of atmospheric variables. From a probabilistic perspective, DA can then be understood as a Bayesian inference problem: our assumptions about the weather can be encoded as a prior $ p\left(\boldsymbol{f}\right) $ , the observations as the likelihood $ p\left(\boldsymbol{y}|\boldsymbol{f}\right) $ , and our updated estimate as the posterior computed via Bayes’ theorem as $ p\left(\boldsymbol{f}|\boldsymbol{y}\right)\propto p\left(\boldsymbol{y}|\boldsymbol{f}\right)p\left(\boldsymbol{f}\right) $ . $ n $ is typically in the billions and $ m $ in the tens of millions (MetOffice 2024), making direct inference using Bayes’ theorem impossible. Thus, the choice of prior and likelihood is heavily influenced by the need to make inference efficient, and we discuss several options in the next section.
In operational DA systems, the problem is more complex, consisting of a 3D spatiotemporal grid and multiple weather variables at each grid point. However, inference in the simplified setting above is still challenging for large $ n $ and $ m $ . In Section 5, we discuss extensions to the full DA problem.
2.2. Existing methods for large-scale inference
2.2.1. Optimal interpolation
Optimal interpolation (Kalnay Reference Kalnay2003, Section 5.4.1), is virtually synonymous with Gaussian process (GP) regression from the machine learning literature (Rasmussen and Williams Reference Rasmussen and Williams2006). This is the basis of all the DA methods discussed in this work, with subsequent methods being approximations to it. GP regression assumes a Gaussian prior $ p\left(\boldsymbol{f}\right)=\mathbf{\mathcal{N}}\left({\boldsymbol{f}}_b,\mathtt{\varSigma}\right) $ and likelihood $ p\left(\boldsymbol{y}|\boldsymbol{f}\right)=\mathbf{\mathcal{N}}\left(\boldsymbol{Hf},\boldsymbol{R}\right) $ . Here, $ {\boldsymbol{f}}_b\in {\mathrm{\mathbb{R}}}^n $ is the prior mean, and $ \mathtt{\varSigma}\in {\mathrm{\mathbb{R}}}^{n\times n} $ the prior covariance defined by a function $ k $ (known as a reproducing kernel), where $ {\boldsymbol{\Sigma}}_{i,j}=k\left({\mathbf{x}}_i,{\mathbf{x}}_j\right) $ . $ \boldsymbol{H}\in {\mathrm{\mathbb{R}}}^{m\times n} $ and $ \boldsymbol{R}\in {\mathrm{\mathbb{R}}}^{m\times m} $ are the linear observation operator and diagonal error covariance, which are either determined using prior knowledge or learned from data. Under this prior and likelihood, the posterior $ p\left(\boldsymbol{f}|\mathbf{y}\right) $ is also a Gaussian, having a closed form expression. Unfortunately, computing the desired posterior costs $ \mathcal{O}\left({l}^3+ nl\right) $ , where $ l $ is the number of grid points at which there are observations. A modern solution to large-scale inference with GPs is to use inducing-point methods, which approximate the observations with a much smaller number of pseudo data points (Titsias Reference Titsias2009). However, the estimates obtained may be too crude for practical use at the scale that is typically considered in numerical weather forecasting.
2.2.2. Gaussian Markov random fields
Another solution to reducing the cubic cost is to use a prior $ p\left(\boldsymbol{f}\right) $ defined by a GMRF (Rue and Held Reference Rue and Held2005). This approach makes use of the spatial interpretation of $ \boldsymbol{f} $ to make a Markovian assumption that each $ {f}_i $ only directly depends on other $ {f}_j $ in its neighborhood. Additionally, the prior $ p\left(\boldsymbol{f}\right) $ and inference process are expressed in terms of the inverse of the covariance matrix, known as the precision. Under the Markovian assumption, the precision matrix is sparse, allowing inference in GMRFs to scale $ \mathcal{O}\left({n}^{3/2}\right) $ in 2D and $ \mathcal{O}\left({n}^2\right) $ in 3D (the complexity does not depend on the number of observations). In addition, the INLA (Integrated Nested Laplace Approximation) framework (Rue et al. Reference Rue, Martino and Chopin2009) makes it possible to handle nonlinear observation models and infer the model hyperparameters from data, although this requires further approximations. A downside of the approach is that it is inherently sequential. Thus, it cannot make effective use of modern parallel computing hardware, such as GPUs, or be distributed across several compute nodes.
2.2.3. 3D-Var
Another alternative that reduces the cost of GP regression is to only compute the MAP estimate, rather than the full posterior, that is find $ {\boldsymbol{f}}_{\mathrm{MAP}}=\arg {\max}_{\boldsymbol{f}}p\left(\boldsymbol{f}|\mathbf{y}\right) $ . This is the approach taken by 3D-Var, which is known as a “variational” method. The maximization can also be expressed as minimizing the cost function $ J\left[\boldsymbol{f}\right]=-\log p\left(\boldsymbol{f}|\mathbf{y}\right)=-\log p\left(\mathbf{y}|\boldsymbol{f}\right)-\log p\left(\boldsymbol{f}\right)+C $ , where $ C $ is a constant that does not depend on $ \boldsymbol{f} $ . Substituting in the GP prior and Gaussian likelihood, the cost function becomes
In practice, this is minimized using an optimizer, for example, L-BFGS (Liu and Nocedal Reference Liu and Nocedal1989) (as seen in D’Amore et al. (Reference D’Amore, Laccetti, Romano, Scotti and Murli2015)) or a Krylov solver (as seen in Bauer et al. (Reference Bauer, Quintino, Wedi, Bonanni, Chrust, Deconinck, Diamantakis, Düben, English, Flemming, Gillies, Hadade, Hawkes, Hawkins, Iffrig, Kühnlein, Lange, Lean, Marsden, Müller, Saarinen, Sarmany, Sleigh, Smart, Smolarkiewicz, Thiemert, Tumolo, Weihrauch, Zanna and Maciel2020)). 3D-Var is more flexible than optimal interpolation, as it can handle nonlinear observation models $ \boldsymbol{H} $ . However, it has the downside of not being able to provide uncertainty estimates, as it is a MAP estimator. In weather forecasting applications, 3D-Var is usually extended to 4D-Var, which includes a time component.
2.3. Factor graphs and inference with message passing
In this work, we perform inference using message passing (Kschischang et al. Reference Kschischang, Frey and Loeliger2001), which computes the marginals of any joint probability model that can be expressed as a factor graph. We introduce message passing for a general continuous distribution $ g\left(\boldsymbol{f}\right)=g\left({f}_1,\dots, {f}_n\right) $ , and specialize it to our posterior in Section 3.2. $ g\left(\boldsymbol{f}\right) $ can be expressed as a factor graph if it has a known decomposition
where $ {\phi}_i $ and $ {\phi}_{i,j} $ , referred to as the nodewise and pairwise factors, respectively, are functions from a variable, or a pair of variables, to $ \mathrm{\mathbb{R}} $ . These do not have to be probability distributions. The factorization in (2) induces a sparse graph on the variables $ {\left\{{f}_i\right\}}_{i=1}^n $ , where two nodes $ {f}_i $ and $ {f}_j $ are connected if and only if $ {\phi}_{i,j} $ is nonconstant. Given such a graph, the algorithm associates a pair of messages on each edge at each iteration $ t $ : a message $ {m}_{ij}^t=\left({a}_{ij}^t,{b}_{ij}^t\right)\in \mathrm{\mathbb{R}}\times \mathrm{\mathbb{R}} $ from $ {f}_i $ to $ {f}_j $ , and a similar message $ {m}_{ji}^t $ from $ {f}_j $ to $ {f}_i $ . We use the message-passing variant introduced by Ruozzi and Tatikonda (Reference Ruozzi and Tatikonda2013) (in turn a generalization of Wiegerinck and Heskes (Reference Wiegerinck and Heskes2002)). This is summarized in Algorithm 1 and illustrated for our application in Figure 2, and Appendix A describes it in more detail.
Algorithm 1 Re-weighted message passing.
1: procedure Message Passing ( $ {\left\{{f}_i\right\}}_{i=1}^n $ , $ {\left\{{\phi}_i\right\}}_{i=1}^n $ , $ {\left\{{\phi}_{ij}\right\}}_{i,j=1}^n $ ) $ \vartriangleright $ input is factor graph
2: $ {m}_{ij}^{t=0}=\left(0,{10}^{-8}\right) $ for all $ i $ , $ j $
3: for $ t\in \left\{1,\dots T\right\} $ do
4: for $ {f}_i $ in the graph do
5: for $ {f}_j\in \left\{{f}_j\sim {f}_i\right\} $ do
6: $ {m}_{ij}^t=\mathrm{C}\mathrm{O}\mathrm{M}\mathrm{PUTE}\ \mathrm{O}\mathrm{UTGOING}\ \mathrm{M}\mathrm{ESSAGE}\left(c,{\phi}_i,\left\{\left({\phi}_{ki},{m}_{ki}^{t-1}\right):\left\{{f}_k\sim {f}_i\right\}\backslash {f}_j\right\}\right) $
7: end for.
8: end for.
9: end for.
10: return $ g\left({f}_i\right)=\mathrm{C}\mathrm{OMPUTE}\ \mathrm{M}\mathrm{ARGINAL},\left\{{m}_{ki}^T:{f}_k\sim {f}_i\right\} $ for all $ i $ .
11: end procedure
We use the notation $ \left\{{f}_k\sim {f}_i\right\} $ to denote the set of all variables $ {f}_k $ that are connected with $ {f}_i $ , $ \mathrm{C}\mathrm{O}\mathrm{M}\mathrm{PUTE}\ \mathrm{O}\mathrm{UTGOING}\ \mathrm{M}\mathrm{ESSAGE}\left(\right) $ and $ \mathrm{C}\mathrm{OMPUTE}\ \mathrm{M}\mathrm{ARGINAL}\left(\right) $ are defined formally in Appendix B, $ T $ is the total number of iterations, and $ c\in \mathrm{\mathbb{R}} $ is a hyperparameter to re-weight contributions of the messages, necessary to aid convergence. Note that each iteration of the inner for loops is independent, and each node only writes and reads messages with the nodes directly connected to it. The algorithm is therefore very amenable to distributed computation.
While this instance of message passing can theoretically compute both the mean and variance of the marginals, in practice, the variance estimates are biased and do not provide useful estimates of the uncertainty (Weiss and Freeman Reference Weiss and Freeman1999). Thus, we only use message passing to compute the posterior mean, the MAP estimate of the posterior. This makes message passing an alternative to 3D/4D-Var.
3. DA with message passing
Our method begins by placing a Matérn GP prior over the domain, which is the de facto standard model choice in spatial geostatistics (Guttorp and Gneiting Reference Guttorp and Gneiting2006), and assuming a Gaussian likelihood. We discretize the prior to a GMRF and derive the corresponding factor graph. Then, we apply a message-passing algorithm to the graph and the observations to compute the marginal posterior means.
3.1. Derivation of the factor graph
The Matérn GP prior can be characterized as the solution to a stochastic partial differential equation (SPDE) of the form
where $ f $ is the process, $ \Delta $ is the Laplacian operator, $ \kappa $ and $ \alpha $ are the positive hyperparameters, and $ \mathbf{\mathcal{W}} $ is the spatial white-noise process with spectral density $ {\sigma}^2q $ , for hyperparameters $ \sigma, q\in \mathrm{\mathbb{R}} $ . Following Lindgren et al. (Reference Lindgren, Rue and Lindström2011), we first derive a GMRF representation of the Matérn GP by discretizing this SPDE using finite differences (finite elements would also be possible). On a uniform 2D grid with step sizes $ \Delta x $ and $ \Delta y $ in the $ x $ and $ y $ directions, respectively, this yields a random matrix–vector system
Here, $ \boldsymbol{L}\in {\mathrm{\mathbb{R}}}^{n\times n} $ is a matrix representing the operator $ \mathrm{\mathcal{L}}:= {\left({\kappa}^2-\Delta \right)}^{\alpha /2} $ under discretization, which is guaranteed to be sparse if the exponent $ \alpha /2 $ is an integer (Lindgren et al. Reference Lindgren, Rue and Lindström2011), and $ \boldsymbol{f},\boldsymbol{w} $ are finite-dimensional vector representations of the random fields $ f $ and $ \mathcal{W} $ (see Appendix C for details on the discretization). Now, (4) implies that $ \boldsymbol{f} $ is a Gaussian random variable of the form
which is a GMRF, since its precision matrix $ \boldsymbol{P}\hskip0.5em := \hskip0.5em \gamma {\boldsymbol{L}}^T\boldsymbol{L} $ is sparse (here, $ \boldsymbol{L} $ is sparse and banded). We can build a graph from this GMRF by the following simple rule: Take all of $ {f}_1,\dots, {f}_n $ as nodes in the graph and connect two nodes $ {f}_i $ and $ {f}_j $ by an edge if $ {\left[\boldsymbol{P}\right]}_{ij}\ne 0 $ . Then, we have
where $ {\Sigma}_{j\sim i}\left(\cdot \right) $ denotes the sum over all indices $ j $ that are adjacent to $ i $ in the graph. Setting
we have that the prior $ p\left(\boldsymbol{f}\right)\propto {\prod}_{i=1}^N{\phi}_i\left({f}_i\right){\prod}_{j\sim i}{\phi}_{ij}\left({f}_i,{f}_j\right) $ , thus we have a factor graph representation.
3.2. Computing the posterior mean with message passing
Having calculated the factor graph corresponding to the prior, we can now apply message passing (Algorithm 1) to combine this with the observations and compute the posterior marginals. We make several modifications to tailor the algorithm to DA, which we summaries below and detail in Appendix B.
3.2.1. Including observations
To add information about the observations $ \mathbf{y} $ into our factor graph, we modify the nodewise factor in (7) by $ {\phi}_i\left({f}_i\right)\ \mapsto\ p\left({y}_i|{f}_i\right){\phi}_i\left({f}_i\right) $ . In our experiments, we assume that the weather variable is noisely observed at a subset of grid cells, where the noise $ {\sigma}_y^2\in \mathrm{\mathbb{R}} $ is constant for all observations. Thus, we set $ p\left({y}_i|{f}_i\right)=\mathcal{N}\left({y}_i|{f}_i,{\sigma}_y^2\right) $ at grid points $ {\mathbf{x}}_i $ where there is an observation, and $ p\left({y}_i|{f}_i\right)=\mathcal{N}\left({y}_i|0,z\right) $ , for very large $ z $ , where there is not.
3.2.2. Update damping
To improve the stability of the algorithm, we follow Pretti (Reference Pretti2005) and dampen the updates of the messages, replacing line 6 of Algorithm 1 with
where $ \eta \in \left(0,1\right) $ is a hyperparameter, which we refer to as the learning rate.
3.2.3. Early stopping
To avoid specifying the total number of iterations as a hyperparameter, we choose a generic large $ T $ and stop when the change in message between iterations is smaller than a threshold.
3.2.4. Multigrid
We apply a multigrid technique to speed up the convergence of the algorithm. These have been used extensively when solving partial differential equations numerically, and provide an efficient way of accelerating the convergence of iterative approaches if multiscale phenomena are being modeled, with grids at different resolutions capturing different spatial scales. In the case of message passing, we can intuitively view information propagating from the observation locations across the graph. If the density of the observations is low, this can take many iterations. To solve this, the multigrid approach starts with a low-resolution grid, and iterates message passing until convergence—this is fast on a low-resolution grid. It then doubles the size of the grid, initializing the messages to the converged messages from the previous grid. This process is repeated until we reach the target grid size. Observations are taken on the target resolution and introduced at each multigrid level when the observation location exactly coincides with the grid level coordinates. Figure 3 illustrates the procedure.
3.2.1. Computational efficiency
The Markovian assumption made by the GMRF prior from which we derive the factor graph results in a sparse graph in which each node is only connected to nodes in its local area. The connectivity is also very regular, with each node connected to the same set of relative nodes except at the edges of the graph. These two properties make the graph high amenable to GPU computation, using an approach similar to (Zhou et al. Reference Zhou, Dedieu, Kumar, Lázaro–Gredilla, Kushagra and George2022) but optimized for our particular graph structure.
The main advantage of the message-passing approach is that it can naturally be distributed by dividing the nodes of the graph into subdomains and splitting them between compute nodes. The only communication required between compute nodes is to update the messages on the borders of the subdomains, which is a small amount of data. Additionally, there is no requirement that the border messages are updated every iteration, so they could be updated asynchronously to the computation within each subdomain.
4. Experiments
We evaluate the performance of message passing on both simulated data and a more realistic surface temperature DA problem. We compare against two baselines: the GMRF method, using the R-INLA implementation (Lindgren and Rue Reference Lindgren and Rue2015), and 3D-Var. Note that R-INLA computes the exact marginals of the posterior, while 3D-Var and message passing are approximate methods that compute only the mean. Thus, R-INLA provides a reference for the best-case error that the approximate methods could achieve.
GPU-accelerated 3D-Var implementation. The 3D-Var cost function is obtained by substituting in the same GMRF prior and Gaussian likelihood as used for message passing, and then minimized using the L-BFGS optimizer. We use our own GPU-accelerated implementation using the experimental support for sparse linear algebra in JAX (Bradbury et al. Reference Bradbury, Frostig, Hawkins, Johnson, Leary, Maclaurin, Necula, Paszke, VanderPlas, Wanderman-Milne and Zhang2018) and JAXopt (Blondel et al. Reference Blondel, Berthet, Cuturi, Frostig, Hoyer, Llinares-López, Pedregosa and Vert2021). We expect this implementation to be significantly faster than the CPU implementations currently in deployment. We also implement message passing in JAX with GPU acceleration, while R-INLA runs on the CPU only.
Hyperparameters. We perform a grid search, reported in Appendix D.1, to select the message weighting and learning rate hyperparameters of message passing, and the early stopping thresholds for both message passing and 3D-Var. We note that selecting the message weighting and learning rate is quite easy: the algorithm converges for a wide range of choices and, if it does converge, the exact choice has only a small effect on the speed of convergence. Appendix E gives the remaining details of all experiments.
4.1. Effect of domain size and observation density
Our first set of experiments are performed on simulated data. We sample a square ground truth field from a GMRF prior, randomly select a given fraction of the grid points as observations, and perform inference against these observations and the same prior. Table 1 shows the RMSE between the posterior mean and the ground truth and the time taken, as we vary the grid size and the observation density. In the message-passing runs, the multigrid approach is used, with a base grid size of $ 32\times 32 $ in all cases.
The results show that 3D-Var and message passing achieve similar RMSEs in all cases, although both have more error than R-INLA when only 1% of the grid is observed. When 5% and 10% of the grid is observed, 3D-Var and message passing take similar amounts of time; however, message passing is significantly slower for a larger grid when only 1% is observed. While the iteration time of message passing is independent of the number of observations, as the observation density falls it takes an increasing number of iterations for the information to propagate from the observed points across the grid, thus early stopping happens later. R-INLA is much slower than the other methods, because it is a sequential method that cannot take advantage of the GPU.
4.2. Large-scale example
For a more realistic use case, we consider the global surface temperature field. We take the ground truth data from a run of the Met Office’s Unified Model (Walters et al. Reference Walters, Baran, Boutle, Brooks, Earnshaw, Edwards, Furtado, Hill, Lock, Manners, Morcrette, Mulcahy, Sanchez, Smith, Stratton, Tennant, Tomassini, Van Weverberg, Vosper, Willett, Browse, Bushell, Carslaw, Dalvi, Essery, Gedney, Hardiman, Johnson, Johnson, Jones, Jones, Mann, Milton, Rumbold, Sellar, Ujiie, Whitall, Williams and Zerroukat2019) at N1280 resolution, where the data are valid for 06UTC 2020-01-01. To avoid issues with boundary conditions, we consider a clipped domain with dimensions $ 2500\times 1500=3.75M $ grid points. We use spherical polar coordinates. The observation locations are generated from the geographical positions (latitude, longitude) of weather-focused satellites calculated over a 3-hour window, which corresponds to $ \approx 8\% $ of the grid being observed. As the prior mean, we select a climatology mean of the global surface temperature calculated from ERA5 (Hersbach et al. Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, De Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, Rosnay, Rozum, Vamborg, Villaume and Thépaut2020).
Figure 1 highlights the resultant mean estimates from the message-passing approach, while error plots are shown in Figure 7 in Appendix D. 3D-Var achieved an area-weighted RMSE of $ 2.33 $ K and took $ 16 $ seconds, while message passing achieved an area-weighted RMSE of $ 1.23 $ K and took $ 115 $ seconds. For comparison, the area-weighted RMSE calculated for the prior mean (ERA5) against the high-resolution temperature field is $ 2.78 $ K. R-INLA did not complete after $ >1 $ hour of processing time; thus, we do not include its results.
5. Conclusion
In this article, we present a new perspective on DA based on insights from the literature on large-scale Bayesian inference. We demonstrate that our message-passing approach is viable and can, in many scenarios, produce results competitive with a GPU-accelerated 3D-Var implementation. In the era of large heterogeneous computing systems, the scalability issues with state-of-the-art variational DA are well documented and the design of the proposed method should offer improved scalability. However, further research is required to determine if this design offers advantages for operational-scale problems.
We make several simplifying assumptions. First, we have only considered spatial inference in two dimensions. However, our approach can be extended to full spatiotemporal inference as we describe in Appendix C.3. We have also only considered a single weather variable and linear observation operators. However, this can be extended relatively easily to multiple weather variables under the current framework, by assuming that they are independent under the prior. Further work is required for variables coupled in the prior. It may also be possible to support nonlinear observation operators using iterative linearization techniques (Kamthe et al. Reference Kamthe, Takao, Mohamed and Deisenroth2022).
The primary limitation of our approach is that message passing only reliably computes the posterior mean; the obtained posterior marginal uncertainties are inherently biased when the graph is loopy (Weiss and Freeman Reference Weiss and Freeman1999). Thus, we cannot get an accurate estimate of the uncertainty in the assimilated state. It also requires Gaussianity assumption in the prior, although for non-Gaussian priors arising from nonlinear stochastic PDEs, we may be able to handle this using an iterative linearization method (Anderka et al. Reference Anderka, Deisenroth and Takao2024). Finally, due to our unreliable uncertainty estimates, we cannot make use of the marginal likelihood to learn the hyperparameters of the prior, for example, the kernel lengthscale. Currently, we handle this using cross-validation on held-out observations. We note, however, that all of these limitations are shared with popular large-scale DA methods, such as 3D-Var and resolving these issues will be a significant step forward for future DA research.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2024.47.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/eds.2024.47.
Acknowledgments
The authors are grateful for the technical assistance of Dr. Cyril Morcrette for providing the Met Office’s Unified Model high-resolution temperature data. The authors are also grateful for Prof. Mohammad Emtiyaz Khan and Prof. Nicholas Ruozzi for useful discussions and feedbacks.
Author contribution
Conceptualization: ST and MD. Methodology: ST and MD. Software: OK and DG. Data curation: DG. Data visualization: DG and OK. Writing—original draft: OK. Writing—review and editing: OK, ST, DG, and MD. Supervision: ST and MD. All authors approved the final submitted draft.
Data availability statement
Our message passing and 3D-Var implementations, and code to reproduce our experiments, are available at https://github.com/oscarkey/message-passing-for-da. It is also archived with at https://doi.org/10.5281/zenodo.14176688. The data for the surface temperature data experiments are taken from the Met Office’s Unified Model, and thus sadly cannot be publicly released. The data for the other experiments are generated automatically by the experiments.
Provenance
This article was accepted into the Climate Informatics 2024 (CI2024) Conference. It has been published in Environmental Data Science on the strength of the CI2024 review process.
Funding statement
OK acknowledges support from the Engineering and Physical Sciences Research Council with grant number EP/S021566/1. ST is supported by a Department of Defense Vannevar Bush Faculty Fellowship held by Prof. Andrew Stuart, and by the SciAI Center, funded by the Office of Naval Research (ONR), under Grant Number N00014-23-1-2729.
Competing interest
The authors declare none.
Ethical standard
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Comments
Dear Reviewers & Editors,
Please find attached our paper accepted at Climate Informatics, ready for publication in the special collection of Environmental Data Science.
Let me know if there is any more information you require!
Many thanks,
Oscar (corresponding author)