Hostname: page-component-cd9895bd7-8ctnn Total loading time: 0 Render date: 2024-12-23T11:57:03.308Z Has data issue: false hasContentIssue false

A locally time-invariant metric for climate model ensemble predictions of extreme risk

Published online by Cambridge University Press:  07 July 2023

Mala Virdee*
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
Markus Kaiser
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom Monumo Ltd., Cambridge, United Kingdom
Carl H. Ek
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
Emily Shuckburgh
Affiliation:
Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
Ieva Kazlauskaite
Affiliation:
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
*
Corresponding author: Mala Virdee; Email: [email protected]

Abstract

Adaptation-relevant predictions of climate change are often derived by combining climate model simulations in a multi-model ensemble. Model evaluation methods used in performance-based ensemble weighting schemes have limitations in the context of high-impact extreme events. We introduce a locally time-invariant method for evaluating climate model simulations with a focus on assessing the simulation of extremes. We explore the behavior of the proposed method in predicting extreme heat days in Nairobi and provide comparative results for eight additional cities.

Type
Methods Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press

Impact Statement

Adaptation to climate change requires predictions of how the frequency and severity of extreme events will change in the future. Here, we consider the occurrence of extreme heat days in cities, which pose serious societal risks including exceedance of human heat stress thresholds. We propose a method for combining multiple climate model simulations that optimizes predictions of such extreme events, and demonstrate the advantages of this method for nine cities.

1. Introduction

1.1. Background

Climate change is increasing the frequency and severity of extreme weather events, including high-temperature extremes (Pörtner et al., Reference Pörtner, Roberts, Tignor, Poloczanska, Mintenbeck, Alegría, Craig, Langsdorf, Löschke, Möller, Okem and Rama2022). The occurrence of heat extremes exceeding human heat stress thresholds is associated with increased mortality and morbidity, particularly in rapidly urbanizing developing economies (Tuholske et al., Reference Tuholske, Caylor, Funk and Evans2021). People particularly exposed and vulnerable to heat stress risk include the urban poor, those in informal housing, the elderly, those with chronic health conditions, and outdoor workers (Cardona et al., Reference Cardona, van Aalst, Birkmann, Fordham, McGregor, Perez, Pulwarty, Schipper, Sinh, Decamps, Keim, Davis, Ebi, Lavell, Mechler, Pelling, Pohl, Oliver-Smith and Thomalla2012). Reliable predictions of future changes in the frequency, intensity, and distribution of high-temperature extremes are particularly critical for cities—impacts are amplified by urban heat island effects and high population density, but city-scale adaptation measures have been demonstrated to significantly reduce risk (Estrada et al., Reference Estrada, Botzen and Tol2017). Here, we predict extreme heat days, defined as days on which average temperature exceeds the 90th percentile of local historically observed temperatures in accordance with several other analyses of changing extreme heat risk (Morak et al., Reference Morak, Hegerl and Christidis2013; Seneviratne et al., Reference Seneviratne, Donat, Mueller and Alexander2014). We introduce a method of evaluating the skill of climate models in simulating observed extreme heat days and derive a multi-model ensemble scheme with a focus on predicting these extremes.

The latest general circulation models (GCMs) effectively reproduce observed large-scale trends and provide robust predictions of global average changes, but exhibit significant uncertainty in the local regime and for prediction of extremes (Flato et al., Reference Flato, Marotzke, Abiodun, Braconnot, Chou, Collins, Cox, Driouech, Emori, Eyring, Forest, Gleckler, Guilyardi, Jakob, Kattsov, Reason and Rummukainen2014). A growing sector of “climate services” aims to bridge the gap between seasonal local weather forecasting and long-term mean climatology to provide decadal to multi-decadal predictions for use in impact assessment and development of adaptation strategies (Meehl et al., Reference Meehl, Goddard, Murphy, Stouffer, Boer, Danabasoglu, Dixon, Giorgetta, Greene, Hawkins, Hegerl, Karoly, Keenlyside, Kimoto, Kirtman, Navarra, Pulwarty, Smith, Stammer and Stockdale2009). Deriving decision-relevant information from GCMs typically involves combining predictions from several models in a multi-model ensemble. The multi-model approach aims to provide more skillful and robust predictions by utilizing the various strengths of different models, as well as an estimate of structural model uncertainty (Stainforth et al., Reference Stainforth, Downing, Washington, Lopez and New2007).

The most straightforward and widely used method of combining predictions from multiple models is to calculate a multi-model mean (MMM), which has been found to outperform any individual model for a range of tasks (Weigel et al., Reference Weigel, Liniger and Appenzeller2008). However, an equal-weighted ensemble does not take into account model skill in simulating the historically observed quantity of interest, and assumes each model is an independent estimate. Modeling groups often share assumptions and biases, leading to overconfident predictions (Tebaldi and Knutti, Reference Tebaldi and Knutti2007). Alternative methods involving unequal independence-based or skill-based weighting of ensemble members include reliability ensemble averaging (Giorgi and Mearns, Reference Giorgi and Mearns2003), independence-weighted mean (Bishop and Abramowitz, Reference Bishop and Abramowitz2013), and Bayesian model averaging (BMA) (Raftery et al., Reference Raftery, Gneiting, Balabdaoui and Polakowski2005).

In Section 2, we introduce a novel method for the evaluation of model simulations which optimizes a skill-based weighting. Here, BMA is used to derive a predictive probability distribution of future temperatures by assigning skill-based weights to each climate model. The approach to model evaluation proposed here could be incorporated into any other skill-based model weighting approach—for a review of schemes used to derive probabilistic predictions from climate model projections for impact assessment and adaptation planning, see Brunner et al. (Reference Brunner, McSweeney, Ballinger, Befort, Benassi, Booth, Coppola, de Vries, Harris, Hegerl, Knutti, Lenderink, Lowe, Nogherotto, O’Reilly, Qasmi, Ribes, Stocchi and Undorf2020).

1.2. Related work

The problem addressed in this work is the measurement of similarity between a simulated and observed time series in the context of climate model evaluation. A similar problem is faced in many other contexts where comparing two time series either according to Euclidean distance or by comparing summary statistics is found to be insufficient. Dynamic time warping (DTW) algorithms allow non-linear alignment between series in contexts where an informative measure should consider the similarity between the “shape” of two signals rather than local temporal synchronization. For example, in the field of automatic speech recognition where DTW originates, a measure should register similarity between the same speech pattern spoken at different speeds (Rabiner and Juang, Reference Rabiner and Juang1993). In climatology, DTW has been applied to measure similarity between local climates to develop a global climate classification scheme (Netzel and Stepinski, Reference Netzel and Stepinski2017).

As discussed further in Section 2, strictly order-preserving alignments such as DTW and its extensions lack flexibility. An alternative approach is to use a divergence metric to calculate a distance between two distributions—disregarding the temporal structure of data allows extremes to be compared. Optimal transport-based divergence methods have been applied to climate model evaluation, for instance, ranking models according to the Wasserstein distance from observed climate (Vissio et al., Reference Vissio, Lembo, Lucarini and Ghil2020).

Some existing methods share the motivation of the work presented in this paper toward developing a flexible time-series similarity measure that also incorporates temporal structure (Zhang et al., Reference Zhang, Tang and Corpetti2020). However, the authors are not aware of any application of such methods for climate model evaluation.

2. Methodology

2.1. Model evaluation

Since GCMs simulate climate, they are not expected to provide synchronous simulations of weather at a specific location under future climate conditions, but it is assumed that they can yield informative statistics of future weather at some aggregate scale. Pointwise evaluation of daily simulations against historical observations requires a climate model to predict weather. Given sequences of $ T $ daily simulations from a climate model $ A $ and historical observations $ B $ against which they are to be evaluated, we cannot expect the time series to match under a daily pointwise error measure such as the root-mean-squared error (RMSE): $ rmse\left(A,B\right):= \sqrt{\frac{1}{T}{\sum}_{t=0}^{T-1}{\left({A}_t-{B}_t\right)}^2}. $

We expect this error to be high even for GCMs that are skilled in reproducing large-scale patterns. RMSE implicitly assumes the two time series to be aligned by comparing the predictions for individual days $ t $ . Summary statistics aim to avoid this issue by introducing some degree of time invariance by binning data into monthly, seasonal, or longer periods. Model error can then be calculated per time bin, for instance, by comparing the simulated and observed average or variance for each period, or by counting-based error measures such as comparing simulated and observed histograms to assess simulated variability. Within each bin, simulations could be permuted freely in time to yield the same measure of skill. This enables models to be evaluated without placing the expectation that they should simulate weather.

However, these summary statistics share two problems. First, they introduce artificial time boundaries by binning into periods to be evaluated independently. It is unclear how boundaries should be chosen optimally (e.g., at the start or middle of each month) and in a way that minimizes loss of accuracy introduced by rapidly changing weather conditions. Second, model precision is reduced. This method of introducing time invariance blurs the model outputs to the resolution of the bins. This is problematic for localized extreme event prediction tasks, where retaining precision may be important.

We propose an evaluation method to reduce the inaccuracy introduced through summary statistics while conserving the time invariance required to avoid the implicit requirement to predict weather. We assume a weather window size $ w $ of time steps within which we cannot expect simulated data points to be aligned with observed data points. To construct a metric that is locally time-invariant, we introduce a permutation $ {\pi}_w $ to the standard RMSE measure before calculating differences in

(1) $$ {\mathrm{\mathcal{L}}}_{\pi_w}^w\left(A,B\right):= \sqrt{\frac{1}{T}\sum \limits_{t=0}^{T-1}{\left({A}_t-{B}_{\pi_w(t)}\right)}^2}. $$

The permutation is constrained to locally reorder the time series within the weather window—that is, every $ {A}_t $ can be compared to the values between $ {B}_{t-w} $ and $ {B}_{t+w} $ . Note that this construction is symmetric with respect to $ A $ and $ B $ . The final metric $ {\mathrm{\mathcal{L}}}^w $ is given by choosing the locally constrained permutation that minimizes the RMSE in

(2) $$ {\displaystyle \begin{array}{c}{\mathrm{\mathcal{L}}}^w\left(A,B\right):= {\mathrm{\mathcal{L}}}_{\pi_w^{\ast}}^w\left(A,B\right)\\ {}\mathrm{with}\hskip0.35em {\pi}_w^{\ast}\in \mathit{\arg}\hskip0.1em {\mathit{\min}}_{\pi_w}{\mathrm{\mathcal{L}}}_{\pi_w}^w\left(A,B\right).\end{array}} $$

Intuitively, we compare the simulations with observations under the assumption that the model was able to predict the weather as well as possible. See Figure 1 for a graphical representation of the algorithm. Since $ {\pi}_w $ is a permutation, the data points in either time series can only be used once, preventing the metric from inventing new data.

Figure 1. The locally time-invariant skill metric $ \mathrm{\mathcal{L}} $ is used to compare a simulated time series $ A $ to a reference time series $ B $ . Instead of calculating the pairwise least-squares error, we propose adding a slack in either direction for reference points with which each simulated data point can be matched. In this illustration, a slack of one time step in either direction is added, as represented by the green shapes. We then find an optimal bipartite matching $ \pi $ that minimizes the sum of distances between the time series. On the left, data points are compared out of order in overlapping windows to calculate distance between $ A $ and $ B $ . On the right, we emphasize that the bipartite matching enforces the constraint that no data point in either time series can be used twice.

Introducing local time invariance through $ {\pi}_w $ solves a similar problem to calculating summary statistics by binning, but it is more precise: by design, there are no boundaries of bins since the whole time series can be considered at once. As a consequence, the weather window size $ w $ can be chosen to be smaller than a bin width since no effects at the boundaries or effects due to bad placement of boundaries need to be considered. We solve the minimization problem via bipartite matching with the following cost matrix, here illustrating the case $ w=1 $ , where pairs at distance greater than $ w $ are assigned infinite cost to prevent matching:

(3) $$ {\displaystyle \begin{array}{ll}{C}^1& \left(A,B\right)=\\ {}& \left(\begin{array}{ccccc}{\left({A}_0-{B}_0\right)}^2& {\left({A}_0-{B}_1\right)}^2& \infty & \infty & \dots \\ {}{\left({A}_1-{B}_0\right)}^2& {\left({A}_1-{B}_1\right)}^2& {\left({A}_1-{B}_2\right)}^2& \infty & \dots \\ {}\infty & {\left({A}_2-{B}_1\right)}^2& {\left({A}_2-{B}_2\right)}^2& {\left({A}_2-{B}_3\right)}^2& \dots \\ {}\vdots & & \ddots & & \vdots \\ {}\dots & & & {\left({A}_{T-1}-{B}_{T-2}\right)}^2& {\left({A}_{T-1}-{B}_{T-1}\right)}^2\end{array}\right)\end{array}}. $$

For details of the bipartite matching algorithm used to solve the minimization problem, see Cormen et al. (Reference Cormen, Leiserson, Rivest and Stein2022). The metric is locally temporally invariant in the sense that within a time window $ w $ , we have relaxed the assumption that simulations must be meaningfully ordered or aligned with respect to observations. We note that local invariance properties are not preserved under composition of permutations, —that is, $ {\pi}_{w_1}\left({\pi}_{w_2}\left(A,B\right)\right)\ne {\pi}_{w_1+{w}_2}\left(A,B\right) $ , and that the metric $ {\mathrm{\mathcal{L}}}^w $ is defined according to the single locally constrained permutation that minimizes error as indicated in Equation (2)).

2.2. Bayesian model averaging

Given an ensemble of $ K $ plausible models $ {M}_1,\dots, {M}_K $ predicting a quantity $ y $ , and training data $ {y}_T $ , BMA provides a method of conditioning on the entire ensemble of models rather than selecting a single “best” model. Here, following an established BMA approach for combining weather forecast models (Raftery et al., Reference Raftery, Gneiting, Balabdaoui and Polakowski2005), the predictive distribution for $ y $ is given by $ p(y)={\sum}_{k=1}^Kp\left(y|{M}_k\right)p\left({M}_k|{y}_T\right) $ , where $ p\left(y|{M}_k\right) $ is the predictive distribution of an individual model $ {M}_k $ and $ p\left({M}_k|{y}_T\right) $ is the posterior probability of $ {M}_k $ given the training data $ {y}_T $ . The BMA prediction is then a weighted average of individual model predictions with weights given by the posterior probability of each model, where $ {\sum}_{k=1}^Kp\left({M}_k|{y}_T\right)=1 $ .

3. Data

Daily mean surface temperature simulations from the historical experiment of five GCMs from the latest phase of the Coupled Model Intercomparison Project (CMIP6) were used to demonstrate the method presented here: GFDL-ESM4, IPSL-CM6A-LR, MPI-ESM1-2-HR, MRI-ESM2-0, and UKESM1-0-LL. For details of the model variants used, see Appendix A of the Supplementary Material. These models were selected for the Inter-Sectoral Impacts Model Intercomparison Project, meeting criteria of structural independence, process representation, and historical simulation for a range of tasks. The subset was also found to span the range of climate sensitivity to atmospheric forcing exhibited in CMIP6 (Lange, Reference Lange2021). ERA5, a high-resolution global gridded observational reanalysis, was used as a reference dataset (Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, De Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, de Rosnay, Rozum, Vamborg, Villaume and Thépaut2020). ERA5 hourly surface temperatures were resampled to provide daily mean temperatures.

A daily mean temperature time series for the grid cell containing Nairobi, Kenya was selected from each GCM and the ERA5 reference dataset. Studies have indicated increasing heat stress risk in East African cities in recent decades (Li et al., Reference Li, Stringer and Dallimer2021); persistent CMIP model biases in simulating climate features in the region have also been noted (Ongoma et al., Reference Ongoma, Chen and Gao2018), making understanding of model uncertainty in this region important. Data were split into a training period of January 1, 1979 to December 31, 1996 and a testing period of January 1, 1997 to December 31, 2014. For each model, a simple mean-shift bias correctionFootnote 1 was applied by calculating the mean error of simulated temperatures relative to ERA5 for the testing period, and subtracting this error from all model data. For the additional experiments described in Section 4, the same approach was used to select GCM and ERA5 reference data for eight other cities: Paris, Chicago, Sydney, Tokyo, Kolkata, Kinshasa, Shenzhen, and Santo Domingo.

A repository containing code to download GCM data, demonstrate the locally time-invariant permutation method, and reproduce the results presented here is made available and can be accessed online.

4. Results

The permutation-based method for model evaluation introduced in Section 2 was demonstrated to derive multi-model ensemble predictions of daily mean temperature in nine cities. For the training period, a permutation $ {\pi}_w $ was applied to each simulated time series with reference to ERA5. BMA was then applied to derive individual model weights and the expected value of the weighted BMA predictive distribution. This method was tested for $ w $  = 3, 15, and 31 (corresponding to matching intervals of 7, 31, and 61 days). These three permutation-based BMA methods are denoted by BMA ( $ {\pi}_3 $ ), BMA ( $ {\pi}_{15} $ ), and BMA ( $ {\pi}_{30} $ ) in Table 1.

Table 1. Results from six multi-model ensemble methods for Nairobi, evaluated against ERA5 reference data. For each method, the predicted number of extreme heat days n in the train and test periods, and RMSE for daily mean temperature predictions for these extreme heat days are shown. The locally time-invariant skill $ {\mathrm{\mathcal{L}}}^{15} $ for predicted temperature for extreme heat days in the test period is also shown.

Abbreviations: BMA, Bayesian model averaging; MMM, multi-model mean.

Three baseline methods were also implemented for comparison: a simple MMM approach, standard BMA without permutation, and a modified BMA approach (denoted BMA [threshold]) where only the simulation of observed extreme heat days was considered when calculating the model weights. The results from each of these six methods for the city of Nairobi are shown in Table 1. The results in Table 1 show the valuation according to the predicted number of extreme heat days and RMSE in predicted mean temperature for these days. Additionally, the locally time-invariant skill metric $ {\mathrm{\mathcal{L}}}^{15} $ for the extreme heat days in the test period is shown.

Figure 2a,c shows a short sample time series of the reference and ensemble simulation data from the test period, showing the predictions given by MMM, standard BMA (Figure 2a), and BMA ( $ {\pi}_{15} $ ) (Figure 2c), with a $ \pm $ 2-standard-deviation region shaded for the BMA methods. Figure 2b,d shows a cross-section for a single day from this time series indicating the BMA and BMA ( $ {\pi}_{15} $ ) predictive distributions as a combination of the weighted ensemble members. These experiments were repeated for eight other cities. The results from each of the six methods is compared for each city in Figure 3.

Figure 2. Left: Sample daily mean temperature time series from Nairobi for a period where observed daily average temperatures exceed historical 90th quantile threshold for several consecutive days, showing individual ensemble members, ERA5 reference, multi-model mean baseline, Bayesian model averaging (BMA) (a), and BMA ( $ {\pi}_{15} $ ) (c) predictions. The shaded region indicates the $ \pm $ 2 standard deviations from BMA predictions. The dotted vertical line indicates the date of cross-section shown right. Right: Cross-section of BMA (b) and BMA ( $ {\pi}_{15} $ ) (d) predictive distributions and individual BMA-weighted ensemble members for 1 day. In this example, BMA ( $ {\pi}_{15} $ ) has assigned greater weight to a model that predicted a higher temperature.

Figure 3. Evaluation of six multi-model ensemble methods for experiments across nine cities, showing (a): RMSE for predicting daily average temperature for all days; and (b): RMSE for predicting daily average temperature for extreme heat days.

A summary plot ranking the best-performing method across these experiments is shown in Figure 4. The individual model weights calculated by each of the six methods for each city are shown in Figure 5. To aid the interpretation of these model weights, the distribution of the data from each model alongside the ERA5 reference for the training period is shown in Figure 6. (Note that a mean-shift bias correction has been applied to the distributions as described in Section 3.)

Figure 4. Summary of rankings of six multi-model ensemble methods for experiments across nine cities, ranked by (a): RMSE in predicting daily average temperature for all days; (b): RMSE in predicting daily average temperature for extreme heat days; and (c): absolute error for predicting number of extreme heat days. In each case, the best-performing method is rank $ 1 $ .

Figure 5. Climate model weights calculated from five Bayesian model averaging methods for experiments from nine cities.

Figure 6. Distributions of daily average temperature for test period from each general circulation model simulation and the ERA5 reference for nine cities.

5. Discussion

The results from the six methods applied to derive multi-model ensemble predictions of daily average temperature for Nairobi (Table 1) indicate that all BMA approaches outperformed the MMM baseline both in predicting the number of extreme heat days $ n $ in the test period, and RMSE for these days. BMA after applying a $ {\pi}_{15} $ permutation was the best-performing method for predicting both $ n $ and RMSE of extreme heat days by a small margin. Experiments applying permutations across a range of window sizes $ {\pi}_w $ indicated that the window size $ w=15 $ (corresponding to an allowed matching interval of 31 days) tended to perform well consistently. Consequently, an additional evaluation of each ensemble method in terms of $ {\mathrm{\mathcal{L}}}^{15} $ is also shown in Table 1, indicating that BMA ( $ {\pi}_{15} $ ) performs best according to this metric.

Further experiments across eight other cities (Figure 3) indicate that while the MMM and standard BMA approaches performed well in predicting RMSE for all days in the test period, the ensemble methods more tailored toward predicting extreme heat days—BMA (threshold), BMA ( $ {\pi}_3 $ ), BMA ( $ {\pi}_{15} $ ), and BMA ( $ {\pi}_{30} $ )—outperformed these baselines across all locations for predicting the RMSE of extreme heat days. In general, we note that RMSE for predicting extreme heat days decreases with $ w $ up to a point, and then begins to increase for larger values (see BMA $ \left({\pi}_{30}\right) $ in Figure 3b). Comprehensive experiments into the effect of window size would be required to draw stronger conclusions regarding the optimal value for a given geographical location.

The rankings of each ensemble method (Figure 4) similarly indicate that while standard BMA and MMM approaches consistently performed well for predicting RMSE for all days, the permutation-based approaches and BMA (threshold), which considered only extreme heat days when assigning model weights, performed better for predicting RMSE for extreme heat days. The permutation-based methods outranked other methods for predicting the number of extreme heat days, including the BMA (threshold) approach, suggesting that the introduction of the local temporal invariance before the model evaluation has led to a better-informed model weighting. These results indicate that there is a need to customize multi-model ensemble schemes for the prediction of extremes. We note that the effect sizes in the results presented here are small, and the analysis of their consistency across other locations and test periods is an area for future study.

The weights assigned to individual models by the five BMA methods are shown in Figure 5. To aid the interpretation of these weights, the distribution of each simulation and the ERA5 reference for the test period is shown in Figure 6. For some cities, it is apparent that low model weights have been assigned where the simulated distribution differs significantly from the reference distribution—see, for example, MRI-EM2-0 and MPI-ESM1-2-HR for Nairobi, and IPSL-CM6A-LR and MRI-ESM2-0 for Kinshasa.

In several cities, model weights vary substantially between standard BMA and the other approaches (see Shenzhen and Tokyo), again highlighting the need to modify model weighting schemes for optimal prediction of extremes. In general, it can be noted that while standard BMA assigns relatively even weightings to each model in the ensemble, the permutation-based approaches impose greater sparsity on the ensemble. Relaxing the assumption of temporal alignment during the model evaluation, therefore, allows a stronger distinction to be made regarding which models should be considered skillful for a particular location.

Repetition of these experiments using alternative realizations of each model (i.e., a different “run” of the same climate model using the same parameters and initial conditions, simulating an alternative pathway given the inherent randomness of the climate system) yielded some variance in the assignation of model weights but broad consistency in the ranking of ensemble methods—results for ensemble methods applied to an alternative set of model realizations for Nairobi is provided in Figure 7 in Appendix B of the Supplementary Material. The method has been demonstrated here for daily average temperature prediction—however, the same reasoning could also be extended to other simulated climate variables in future work.

6. Conclusion

We present a novel permutation-based method for the evaluation of climate model simulations that introduces local temporal invariance. This enables us to relax the assumption that simulated extremes should be temporally aligned or ordered without reducing the temporal precision of models. This evaluation method is tested within a BMA multi-model ensemble weighting scheme to derive probabilistic predictions of extreme heat days for nine cities. Our results highlight the need for model evaluation methods tailored for assessing the simulation of extremes when producing multi-model ensemble projections for impact assessment and adaptation planning. We find that the incorporation of the local temporal invariance during the model evaluation enables a more skillful model weighting to be derived, yielding improved prediction of the number of extreme heat days and RMSE for these days compared to standard BMA. We highlight directions for future work, including the advancement of the methodology presented here and approaches to tailor ensemble methods for the predictions of extreme events.

Author contribution

Conceptualization, methodology, and writing—original draft: all authors. All authors approved the final submitted draft.

Competing interest

The authors declare no competing interests exist.

Ethics statement

The research meets all ethical guidelines, including adherence to the legal requirements of the United Kingdom.

Funding statement

M.V. is funded via the UKRI Centre for Doctoral Training in Artificial Intelligence for Environmental Risk (Grant No. EP/S022961/1).

Provenance

This article is part of the Climate Informatics 2023 proceedings and was accepted in Environmental Data Science on the basis of the Climate Informatics peer-review process.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/eds.2023.13.

Footnotes

This research article was awarded Open Data and Open Materials badges for transparent practices. See the Data Availability Statement for details.

1 Bias correction is not the focus of this work—for a critical discussion of bias correction of systematic errors in post-processing climate model outputs for impact assessment, see Ehret et al. (Reference Ehret, Zehe, Wulfmeyer, Warrach-Sagi and Liebert2012).

References

Bishop, CH and Abramowitz, G (2013) Climate model dependence and the replicate earth paradigm. Climate Dynamics 41(3), 885900.CrossRefGoogle Scholar
Brunner, L, McSweeney, C, Ballinger, AP, Befort, DJ, Benassi, M, Booth, B, Coppola, E, de Vries, H, Harris, G, Hegerl, GC, Knutti, R, Lenderink, G, Lowe, J, Nogherotto, R, O’Reilly, C, Qasmi, S, Ribes, A, Stocchi, P and Undorf, S (2020) Comparing methods to constrain future European climate projections using a consistent framework. Journal of Climate 33(20), 86718692.CrossRefGoogle Scholar
Cardona, OD, van Aalst, MK, Birkmann, J, Fordham, M, McGregor, G, Perez, R, Pulwarty, RS, Schipper, ELF, Sinh, BT, Decamps, H, Keim, M, Davis, I, Ebi, KL, Lavell, A, Mechler, R, Pelling, M, Pohl, J, Oliver-Smith, A and Thomalla, F (2012) Determinants of risk: Exposure and vulnerability. In Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation: Special Report of the Intergovernmental Panel on Climate Change. Cambridge: Cambridge University Press, pp. 65108.CrossRefGoogle Scholar
Cormen, TH, Leiserson, CE, Rivest, RL and Stein, C (2022) Introduction to Algorithms. Cambridge, MA: MIT Press.Google Scholar
Ehret, U, Zehe, E, Wulfmeyer, V, Warrach-Sagi, K and Liebert, J (2012) HESS opinions “Should we apply bias correction to global and regional climate model data?” Hydrology and Earth System Sciences 16(9), 33913404.CrossRefGoogle Scholar
Estrada, F, Botzen, WJ and Tol, RSJ (2017) A global economic assessment of city policies to reduce climate change impacts. Nature Climate Change 7(6), 403406.CrossRefGoogle Scholar
Flato, G, Marotzke, J, Abiodun, B, Braconnot, P, Chou, SC, Collins, W, Cox, P, Driouech, F, Emori, S, Eyring, V, Forest, C, Gleckler, P, Guilyardi, E, Jakob, C, Kattsov, V, Reason, C and Rummukainen, M (2014) Evaluation of climate models. In Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge: Cambridge University Press, pp. 741866.Google Scholar
Giorgi, F and Mearns, LO (2003) Probability of regional climate change based on the reliability ensemble averaging (REA) method. Geophysical Research Letters 30(12), 1629. https://doi.org/10.1029/2003GL017130CrossRefGoogle Scholar
Hersbach, H, Bell, B, Berrisford, P, Hirahara, S, Horányi, A, Muñoz-Sabater, J, Nicolas, J, Peubey, C, Radu, R, Schepers, D, Simmons, A, Soci, C, Abdalla, S, Abellan, X, Balsamo, G, Bechtold, P, Biavati, G, Bidlot, J, Bonavita, M, De Chiara, G, Dahlgren, P, Dee, D, Diamantakis, M, Dragani, R, Flemming, J, Forbes, R, Fuentes, M, Geer, A, Haimberger, L, Healy, S, Hogan, RJ, Hólm, E, Janisková, M, Keeley, S, Laloyaux, P, Lopez, P, Lupu, C, Radnoti, G, de Rosnay, P, Rozum, I, Vamborg, F, Villaume, S and Thépaut, J-N (2020) The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society 146(730), 19992049.CrossRefGoogle Scholar
Lange, S (2021). ISIMIP3 bias adjustment fact sheet. Available at https://www.isimip.org/documents/413/ISIMIP3b_bias_adjustment_fact_sheet_Gnsz7CO.pdf.Google Scholar
Li, X, Stringer, LC and Dallimer, M (2021) The spatial and temporal characteristics of urban heat island intensity: Implications for East Africa’s urban development. Climate 9(4), 51.CrossRefGoogle Scholar
Meehl, GA, Goddard, L, Murphy, J, Stouffer, RJ, Boer, G, Danabasoglu, G, Dixon, K, Giorgetta, MA, Greene, AM, Hawkins, E, Hegerl, G, Karoly, D, Keenlyside, N, Kimoto, M, Kirtman, B, Navarra, A, Pulwarty, R, Smith, D, Stammer, D and Stockdale, T (2009) Decadal prediction: Can it be skillful? Bulletin of the American Meteorological Society 90(10), 14671486.CrossRefGoogle Scholar
Morak, S, Hegerl, GC and Christidis, N (2013) Detectable changes in the frequency of temperature extremes. Journal of Climate 26(5), 15611574.CrossRefGoogle Scholar
Netzel, P and Stepinski, TF (2017) World climate search and classification using a dynamic time warping similarity function. In Advances in Geocomputation: Geocomputation 2015—the 13th International Conference. Springer International Publishing, Springer, pp. 181195.CrossRefGoogle Scholar
Ongoma, V, Chen, H and Gao, C (2018) Projected changes in mean rainfall and temperature over East Africa based on CMIP5 models. International Journal of Climatology 38(3), 13751392.CrossRefGoogle Scholar
Pörtner, H-O, Roberts, DC, Tignor, M, Poloczanska, ES, Mintenbeck, K, Alegría, A, Craig, M, Langsdorf, S, Löschke, S, Möller, V, Okem, A and Rama, B (2022) Climate change 2022: Impacts, adaptation and vulnerability. IPCC Sixth Assessment Report. IPCCGoogle Scholar
Rabiner, L and Juang, B-H (1993) Fundamentals of Speech Recognition. Engelwood, NJ: Prentice-Hall, Inc.Google Scholar
Raftery, AE, Gneiting, T, Balabdaoui, F and Polakowski, M (2005) Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review 133(5), 11551174.CrossRefGoogle Scholar
Seneviratne, SI, Donat, MG, Mueller, B and Alexander, LV (2014) No pause in the increase of hot temperature extremes. Nature Climate Change 4(3), 161163.CrossRefGoogle Scholar
Stainforth, DA, Downing, TE, Washington, R, Lopez, A and New, M (2007) Issues in the interpretation of climate model ensembles to inform decisions. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 365(1857), 21632177.CrossRefGoogle ScholarPubMed
Tebaldi, C and Knutti, R (2007) The use of the multi-model ensemble in probabilistic climate projections. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 365(1857), 20532075.CrossRefGoogle ScholarPubMed
Tuholske, C, Caylor, K, Funk, C and Evans, T (2021) Global urban population exposure to extreme heat. Proceedings of the National Academy of Sciences 118(41), e2024792118.CrossRefGoogle ScholarPubMed
Vissio, G, Lembo, V, Lucarini, V and Ghil, M (2020) Evaluating the performance of climate models based on Wasserstein distance. Geophysical Research Letters 47(21), e2020GL089385.CrossRefGoogle Scholar
Weigel, AP, Liniger, MA and Appenzeller, C (2008) Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts? Quarterly Journal of the Royal Meteorological Society: A Journal of the Atmospheric Sciences, Applied Meteorology and Physical Oceanography 134(630), 241260.CrossRefGoogle Scholar
Zhang, Z, Tang, P and Corpetti, T (2020) Time adaptive optimal transport: A framework of time series similarity measure. IEEE Access 8, 149764149774.CrossRefGoogle Scholar
Figure 0

Figure 1. The locally time-invariant skill metric $ \mathrm{\mathcal{L}} $ is used to compare a simulated time series $ A $ to a reference time series $ B $. Instead of calculating the pairwise least-squares error, we propose adding a slack in either direction for reference points with which each simulated data point can be matched. In this illustration, a slack of one time step in either direction is added, as represented by the green shapes. We then find an optimal bipartite matching $ \pi $ that minimizes the sum of distances between the time series. On the left, data points are compared out of order in overlapping windows to calculate distance between $ A $ and $ B $. On the right, we emphasize that the bipartite matching enforces the constraint that no data point in either time series can be used twice.

Figure 1

Table 1. Results from six multi-model ensemble methods for Nairobi, evaluated against ERA5 reference data. For each method, the predicted number of extreme heat days n in the train and test periods, and RMSE for daily mean temperature predictions for these extreme heat days are shown. The locally time-invariant skill $ {\mathrm{\mathcal{L}}}^{15} $ for predicted temperature for extreme heat days in the test period is also shown.

Figure 2

Figure 2. Left: Sample daily mean temperature time series from Nairobi for a period where observed daily average temperatures exceed historical 90th quantile threshold for several consecutive days, showing individual ensemble members, ERA5 reference, multi-model mean baseline, Bayesian model averaging (BMA) (a), and BMA ($ {\pi}_{15} $) (c) predictions. The shaded region indicates the $ \pm $2 standard deviations from BMA predictions. The dotted vertical line indicates the date of cross-section shown right. Right: Cross-section of BMA (b) and BMA ($ {\pi}_{15} $) (d) predictive distributions and individual BMA-weighted ensemble members for 1 day. In this example, BMA ($ {\pi}_{15} $) has assigned greater weight to a model that predicted a higher temperature.

Figure 3

Figure 3. Evaluation of six multi-model ensemble methods for experiments across nine cities, showing (a): RMSE for predicting daily average temperature for all days; and (b): RMSE for predicting daily average temperature for extreme heat days.

Figure 4

Figure 4. Summary of rankings of six multi-model ensemble methods for experiments across nine cities, ranked by (a): RMSE in predicting daily average temperature for all days; (b): RMSE in predicting daily average temperature for extreme heat days; and (c): absolute error for predicting number of extreme heat days. In each case, the best-performing method is rank $ 1 $.

Figure 5

Figure 5. Climate model weights calculated from five Bayesian model averaging methods for experiments from nine cities.

Figure 6

Figure 6. Distributions of daily average temperature for test period from each general circulation model simulation and the ERA5 reference for nine cities.

Supplementary material: PDF

Virdee et al. supplementary material

Appendices A-B

Download Virdee et al. supplementary material(PDF)
PDF 79.8 KB