Impact Statement
In this study, we show that a statistical prediction model for observed western US wintertime temperature trained on long dynamical model simulations outperforms a statistical model trained on observations alone. This encouraging result suggests that statistical subseasonal prediction models can be further improved by training on both dynamical model simulations and observations.
1. Introduction
Medium-range weather (up to 10 days) and long-range climate forecasts (months-to-seasons) have been used operationally for decades. While the performance of forecast systems targeting these timescales has steadily improved, until recently, relatively little effort has been dedicated to the advancing prediction capabilities at the intermediary subseasonal (2-week-to-1 month) timescales (e.g., National Academy of Science, 2016). Nevertheless, there is evidence that forecasts are skillful on subseasonal timescales (Newman et al., Reference Newman, Sardeshmukh, Winkler and Whitaker2003; Pegion and Sardeshmukh, Reference Pegion and Sardeshmukh2011). In particular, state-of-the-art numerical forecast models demonstrate skill in subseasonal prediction, including regional precipitation and temperature, extreme events (heat waves, cold waves, likelihood of hurricane formation), tornado and hail activity (DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017; Vitart and Robertson, Reference Vitart and Robertson2018).
Tremendous societal needs have driven improvements in subseasonal forecast capabilities. Warnings of weather hazards such as drought or cold temperature extremes 2-to-4 weeks in advance have the potential to save lives and mitigate changing demands on energy supplies, water resources, the agriculture sector, and fisheries (White et al., Reference White, Carlsen, Robertson, Klein, Lazo, Kumar, Vitart, Coughlan de Perez, Ray, Murray, Bharwani, MacLeod, James, Fleming, Morse, Eggen, Graham, Kjellström, Becker, Pegion, Holbrook, McEvoy, Depledge, Perkins-Kirkpatrick, Brown, Street, Jones, Remenyi, Hodgson-Johnston, Buontempo, Lamb, Meinke, Arheimer and Zebiak2017). Given the far-reaching societal benefits, numerical modeling projects within the United States (SubX) and internationally (S2S) have been established with the goal of improving subseasonal forecast skill (Vitart et al., Reference Vitart, Ardilouze, Bonet, Brookshaw, Chen, Codorean, Déqué, Ferranti, Fucile, Fuentes, Hendon, Hodgson, Kang, Kumar, Lin, Liu, Liu, Malguzzi, Mallas, Manoussakis, Mastrangelo, MacLachlan, McLean, Minami, Mladek, Nakazawa, Najm, Nie, Rixen, Robertson, Ruti, Sun, Takaya, Tolstykh, Venuti, Waliser, Woolnough, Wu, Won, Xiao, Zaripov and Zhang2017; Pegion et al., Reference Pegion, Kirtman, Becker, Collins, LaJoie, Burgman, Bell, DelSole, Min, Zhu, Li, Sinsky, Guan, Gottschalck, Metzger, Barton, Achuthavarier, Marshak, Koster, Lin, Gagnon, Bell, Tippett, Robertson, Sun, Benjamin, Green, Bleck and Kim2019). Parallel to the establishment of numerical modeling initiatives, in 2016 the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) established a Subseasonal Climate Forecast Rodeo, a one-year forecast competition where participants were tasked with developing statistical models for real-time prediction of western United States (US) temperature and precipitation. The inaugural winners, Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), developed a forecast system of nonlinear statistical models trained on a diverse set of observational predictors (i.e., soil moisture, geo-potential heights). Their statistical forecast system was found to be more accurate than operational US Climate Forecasting System (CFSv2). The success of the Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019) forecast system demonstrated unequivocally the utility of statistical methods for subseasonal prediction.
It is plausible that statistical models could be improved still further if they were trained on longer data sets. Unfortunately, observational data sets for subseasonal prediction are limited to 50 years or less. Moreover, the effective sample size is smaller than this because predictability mechanisms differ across seasons, suggesting that models need to be trained for each season separately, and daily temperature is serially correlated. One approach to obtaining longer training sets is to use dynamical models to generate them. Of course, dynamical models are imperfect and may be less skillful than some statistical models. Nevertheless, dynamical models are based on the laws of physics and simulate many of the complex physical processes that impact subseasonal predictability, so it makes sense to try to use both observational and dynamical models to constrain the statistical fit. In this paper, we train statistical models on dynamical model simulations and then use the resulting statistical models to predict observational data. To mitigate the impact of model errors, particularly those in the subgrid parameterizations that often differ between dynamical models, we pursue a multi-model approach in which the outputs of several dynamical models are pooled together for training data. This leads to sample sizes orders of magnitude longer than observational data sets. Note that in this approach, observations are not used to estimate empirical coefficients, so any predictive skill arising from the resulting statistical models clearly comes from the dynamical models and demonstrates that the dynamical models simulate statistical relations relevant to subseasonal predictability. Presumably, the best statistical models are those that combine both dynamical model simulations and observational data. Such approaches are part of an active field of research in transfer learning (Zhuang et al., Reference Zhuang, Qi, Duan, Xi, Zhu, Zhu, Xiong and He2021). In this paper, we use dynamical models to estimate the regression coefficients of the statistical models and then use observations to select the tuning parameter in the statistical model. The resulting prediction model is then compared to those trained on observations.
The design of our forecast problem is similar to that of the Forecast Rodeo. Specifically, we predict the week 3–4 local temperature at a set of grid points over the western US. Each forecast model is estimated from the least absolute shrinkage and selection operator (lasso) method, which is a standard method in machine learning (Hastie et al., Reference Hastie, Tibshirani and Friedman2017). Our approach is similar to that of DelSole and Banerjee (Reference DelSole and Banerjee2017) and Buchmann and DelSole (Reference Buchmann and DelSole2021), except that here we predict many grid points, instead of just one local region (as in the former paper) or just one large-scale pattern (as in the latter paper). We find that statistical models trained on dynamical model data perform better than those trained on observations, suggesting that sample size is indeed a limiting factor in the statistical prediction of subseasonal temperature.
We clarify that our goal is not to derive a statistical forecast system that outperforms that of Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019). Rather, our goal is to show the potential of improving skill of statistical models by incorporating information from dynamical model simulations. Several studies suggest that the strongest source of subseasonal predictability comes from sea surface temperatures (SSTs) in the tropical Pacific (Vitart, Reference Vitart2013; McKinnon et al., Reference McKinnon, Rhines, Tingley and Huybers2016; DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017). Accordingly, we use a set of predictors that capture variations in SSTs. It is quite likely that the performance could be improved further by increasing our predictor set to include other variables like precipitation, sea ice concentration, and indices of the Madden–Julian Oscillation, as in Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), but the dynamical models do not simulate these variables well. Moreover, SSTs are the dominant source of predictability (Vitart, Reference Vitart2013). For these reasons, only SSTs are included as predictors.
This paper is presented as follows. In Section 2, we describe the data sets and methods used to derive a set of statistical models for predicting observed wintertime subseasonal temperature over the western US. The statistical models are trained either on observations or preindustrial control runs from Climate Model Inter-comparison project phase 6 (CMIP6) archive. In Section 3, we present the results. We begin with an overall performance evaluation of the different statistical models and select for comparison the best-performing candidate models among the observation and CMIP6-trained group of models. We go on to demonstrate the advantages of using a large dataset such as CMIP6 data as opposed to the relatively short observational record to fit the predictive models. We then compare the spatial and temporal skills of these statistical models in predicting submonthly observed temperatures. The paper concludes with a summary of our major results.
2. Data and Methods
2.1. Observations
The target data for our study is observed 2-week mean temperature, obtained by averaging daily minimum and maximum 2-meter temperatures from the NOAA-Climate Prediction Center Global Gridded Temperature dataset (https://psl.noaa.gov/data/gridded/data.cpc.globaltemp.html). The data are available from 1979 to the present, but we focus our analysis on the years 1982–2019 to avoid the large number of missing data at the start of the record. Daily SST data are obtained from the NOAA Optimum Interpolation SST dataset for the period 1982–2019 (Reynolds et al., Reference Reynolds, Smith, Liu, Chelton, Casey and Schlax2007). All data are regridded to a 1 × 1 degree resolution.
2.2. CMIP6
We also use climate model simulations for training. The particular model simulations we use are from the CMIP6 archive. To avoid the confounding effects of external forcings (i.e., greenhouse gases and anthropogenic aerosols) on predictability, we limit our selection to preindustrial control simulations. A collection of 13 models with a collective total of 6889 years of daily data are selected for analysis (see Table 1). These models were selected because they provide daily surface temperature data. Consistent with observations, the target data are 2-week mean temperatures estimated by averaging the minimum and maximum of daily 2-m temperature. Model SST data are also used. All data are regridded to a 1 × 1 degree resolution. A description of the CMIP6 experiments can be found in Eyring et al. (Reference Eyring, Bony, Meehl, Senior, Stevens, Stouffer and Taylor2016).
Note. Models were selected if daily data was available.
2.3. Data processing
The statistical forecast models are fit at the 499 grid points shown in Figure 1, using 2-week averages for target and predictor variables and predictions target the winter months December to February. The forecast year for a given winter corresponds to December. To illustrate, a forecast for winter 2000, targets 2-week averages from December 2000 to February 2001. Averages for surface temperature are estimated inclusively for every start day in the winter months of December–February. For instance, the 2-week average for December 1st is given by the average over the period December 1–14, and the corresponding predictors are averaged for the dates November 4–17. Climatology for daily-to-weekly data is often estimated using local regression and polynomial or harmonic regression (DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017; Pegion et al., Reference Pegion, Kirtman, Becker, Collins, LaJoie, Burgman, Bell, DelSole, Min, Zhu, Li, Sinsky, Guan, Gottschalck, Metzger, Barton, Achuthavarier, Marshak, Koster, Lin, Gagnon, Bell, Tippett, Robertson, Sun, Benjamin, Green, Bleck and Kim2019). A study by Pegion et al. (Reference Pegion, Kirtman, Becker, Collins, LaJoie, Burgman, Bell, DelSole, Min, Zhu, Li, Sinsky, Guan, Gottschalck, Metzger, Barton, Achuthavarier, Marshak, Koster, Lin, Gagnon, Bell, Tippett, Robertson, Sun, Benjamin, Green, Bleck and Kim2019) demonstrated that these different methods produce nearly identical climatologies. In this work, we estimate the climatology as a fifth-order polynomial fit across all 2-week means between December and February (e.g., DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017). This order of polynomial is selected to ensure that the climatological signals are accurately estimated at the different geographical locations. We tested whether the statistical models performance are sensitive to polynomial order and found no impact on skill (not shown). Anomalies are estimated with respect to climatology. Observed anomalies are detrended to remove the impacts of anthropogenic forcings (e.g., Johnson et al., Reference Johnson, Collins, Feldstein, L’Heureux and Riddle2014).
Climatically relevant variations of SST are large-scale and characterized by distinct patterns (e.g., Deser et al., Reference Deser, Alexander, Xie and Phillips2010). We select predictors from the Pacific and Atlantic Oceans, where fluctuations in SSTs are known to impact global climate through teleconnections (Horel and Wallace, Reference Horel and Wallace1981). The respective domains of the predictors are shown in Figure 2 as the black and blue regions. No single pattern of SST variability drives all predictable variations in climate. For this reason, we use multiple patterns to represent large-scale variations of SSTs. Typically, these large-scale patterns of variability are estimated using empirical orthogonal function (EOF) analysis. A drawback to EOF analysis is that the patterns recovered are data-dependent. This means that the leading patterns of variability that represent one data set may not be the same pattern recovered for a different data set. Ultimately, when comparisons between data sets are desirable, it is preferable to use a common basis set to describe all data. For this reason, we will isolate the large-scale variations by projecting the daily SST data onto the eigenvectors of the Laplacian operator (DelSole and Tippett, Reference DelSole and Tippett2015). The Laplacian eigenvectors form a basis that depends only on the domain geometry and are orthogonal in space, with the first pattern associated with the spatial mean, the second a dipole, the third a tripole, and so forth. Projecting the data onto each Laplacian eigenvector yields a time series for each eigenvector. An example of the leading Laplacians is shown in Figure 3. In this study, we will represent large-scale SST variations in the Pacific and Atlantic using 50 Laplacians for each basin. The inclusion of more Laplacians time series did not impact the performance of the lasso models.
2.4. Building the statistical forecast systems
Our objective is to determine if training on dynamical models can produce skillful prediction models. There is no unique configuration for statistical models, so we consider a range of reasonable choices. Specifically, we construct five distinct statistical forecast systems to predict observed western US subseasonal winter temperatures. Each forecast system is comprised of a set of statistical models, fit at each grid point in the target region (see Figure 1), yielding a total of 499 statistical models per forecast system. The candidate forecast systems differ in terms of how the grid-point statistical models are fit and the data sets used to train the models. A description of each statistical forecast systems is provided below and summarized in Table 2.
2.4.1. Benchmark
ENSO is the greatest contributor to seasonal predictability over the US, and hence it is anticipated to be a strong contributor to subseasonal predictability (National Academy of Sciences, 2016). Accordingly, we construct a benchmark forecast where 2-week average surface winter temperature anomalies are predicted at each grid point from the 2-week lagged Nino3.4 index, a commonly used measure of ENSO variability. The Nino3.4 index is the spatial average of SST anomalies between 5°S-5°N over longitudes 120–170°W.
Let $ n $ and $ s $ denote the temporal and spatial indices, where $ n=1,\dots, N $ and $ s=1,\dots, S $ . Let $ {Y}_{\mathrm{ns}} $ denote the target variable and $ {X}_{\mathrm{np}} $ denote the predictors for $ p=1,\dots, P $ . Let the Nino3.4 index be denoted by $ {z}_n $ . The prediction based on Nino3.4 index is a linear regression model, where the regression coefficients $ {\beta}_s=\left({\beta}_{0,s},{\beta}_{1,s}\right) $ are obtained by minimizing the cost function
Note that the regression coefficients are chosen to minimize the cost function at each spatial location separately. These regression models are fit using a leave-one-out approach, such that the regression coefficients $ {\beta}_{0,s} $ and $ {\beta}_{1,s} $ for a given winter are estimated from all other winters from the observational dataset.
2.4.2. Lasso
The performance of the benchmark models will be compared to predictions made by a suite of statistical models based on lasso. The different lasso models have the same set of target and predictor variables. That is, the target variables are 2-week mean surface temperatures anomalies and the predictors are large-scale SST anomalies in the Pacific and Atlantic Oceans, which are represented by 50 Laplacian time series for each basin, giving a total of 100 SST predictors. The predictors lag the target variables by 2 weeks. We estimate the lasso coefficients by either minimizing the cost function locally or across all grid points. This distinction and the difference between lasso and OLS are discussed in more detail below.
Lasso is similar to OLS, except that the mean square error (MSE) is minimized subject to a constraint on the norm of the regression coefficients (Tibshirani, Reference Tibshirani1996). More precisely, the lasso coefficients $ {\beta}_{p,s} $ are obtained by minimizing the cost function
where $ {X}_{n,p} $ denotes elements in the matrix of predictors, $ \mid \cdot \mid $ denotes the absolute value, and $ \lambda $ is a tuning parameter that determines the strength of the penalty term. As $ \lambda $ is increased, the lasso coefficients are reduced toward 0. Conversely, as $ \lambda $ goes to 0, the penalty term has less weight and the cost function approaches the traditional OLS form.
There is no closed-form solution to equation (2) and the minimization problem must be solved iteratively. We use the glmnet package to find lasso solutions as $ \lambda $ is varied (Friedman et al., Reference Friedman, Hastie and Tibshirani2010). Examples of how $ \lambda $ is selected for different lasso models are provided in Section 2.4.3.
The above formulation predicts each target variable separately. We call this formulation “single-task” lasso and we derive two lasso models from the above minimization problem: one trained on observations and one trained on CMIP6 data. These lasso models are called OBS-single-task and CMIP6-single-task, respectively.
Weekly temperature is spatially correlated, so making use of information between neighboring grid points during the training stage may yield a better prediction model. One approach to doing this is “multi-task lasso,” which was used by Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019). The cost function for multi-task lasso is
In multi-task lasso, squared errors are summed over all targets and the penalty term now applies to the whole group of predictors and a given predictor is either included in the statical model for all targets, or excluded for all targets.
We selected lasso regression to build our statistical forecast models because the method tends to reduce some of the regression coefficients to 0, thus performing predictor selection and aiding in the interpretability of the statistical models.
2.4.3. Selecting the lasso tuning parameter
For lasso models trained on observations, the first 18 years (1982–1999) of observational data are used to fit the regression model and to select $ \lambda $ using a 10-fold cross-validation. When CMIP6 data are used for training, the $ \lambda $ is selected to minimize the normalized mean-square-error (NMSE) with respect to the same 18 years of observations during the period 1982–1999. A summary of how the statistical models are trained and tuning parameter $ \lambda $ selected is presented in Table 2.
To illustrate the $ \lambda $ selection process, Figure 4a–c shows curves of NMSE versus $ \lambda $ at three different locations for predictions made by the CMIP6-single-task model. The regression coefficients are estimated from CMIP6 simulations, yielding $ {\beta}_s\left(\lambda \right) $ , then, based on these coefficients, $ {\mathrm{NMSE}}_s\left(\lambda \right) $ is evaluated at each location $ s $ using observations for predictors and target variable. The $ \lambda $ that minimizes NMSE is denoted by a red asterisk. Two extreme cases are shown in Figure 4a,c, where the $ \lambda $ that minimizes NMSE is small (near zero) and large, respectively. For the case where $ \lambda \approx 0 $ (Figure 4a), the cost function approaches the traditional OLS form and all the predictors are included. Alternatively, when $ \lambda $ is large (Figure 4c), the regression coefficients are set to zero. The NMSE- $ \lambda $ curve shown in Figure 4b represents an intermediate scenario, where only some regression coefficients are set to 0.
2.5. Skill metrics
Statistical model performance will be evaluated in terms of similarity between forecast and verification data and prediction accuracy. The most widely used metrics to evaluate these measures of model performance are temporal correlation, spatial correlation, and MSE (Jolliffe and Stephenson, Reference Jolliffe and Stephenson2012; Coelho et al., Reference Coelho, Brown, Wilson, Mittermaier, Casati, Robertson and Vitart2019).
For temporal similarity measures, we will use the standard Pearson correlation coefficient. We refer to this metric as the temporal correlation coefficient since it is estimated across time at each grid point as:
where $ {F}^{\prime}\left(s,t\right) $ and $ {V}^{\prime}\left(s,t\right) $ are the $ t $ matched pairs of centered forecast and verification data at location $ s $ , where there are $ S $ total grid-point forecasts.
Spatial similarity in the predicted and observed spatial patterns will be measured using the anomaly correlation or cosine similarity (Jolliffe and Stephenson, Reference Jolliffe and Stephenson2012). To avoid confusion, we will refer to this metric as the spatial correlation. Note, this is the only metric used in evaluating the machine learning forecast models of Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019). The expression used to compute the spatial correlation is similar in form to $ \rho (s) $ , with the exception that the spatial mean has not been removed and the summation is over space rather than time. Formally, the spatial correlation is
The spatial correlation measures the similarity of spatial patterns of 2-week temperature anomalies. Importantly, the spatial correlation can be computed for each 2-week period; that is, it is a time series. Forecast accuracy is often measured by MSE and is formally expressed as:
A standard approach is to compare MSE to some reference forecast, typically the climatological mean corresponding to $ {\overline{F}}^{\prime }(s)=0 $ , which yields the $ \mathrm{NMSE}(s) $ :
A forecast with NMSE >1 has no skill, since its MSE is greater than that of the reference forecast. The NMSE can be decomposed into its constituent parts when expressed as the mean-square-error skill score (MSESS). Following Murphy (Reference Murphy1988), MSESS can be expanded as follows:
where $ \rho (s) $ is the temporal correlation (see equation (4)), and $ {\sigma}_{F^{\prime }}(s) $ and $ {\sigma}_{V^{\prime }}(s) $ are the standard deviations for the forecast and verification anomalies respectively. The first term in the MSESS is the square of the temporal correlation ( $ \rho {(s)}^2 $ ) and gives the maximum value of MSESS. The next two terms reduce the skill score and represent the amplitude and the mean biases, respectively. For this study, the forecast and verification data are centered and consequently the mean bias term is 0. We perform a decomposition of MSESS to evaluate the trade-off between correlation and amplitude bias impacts on statistical model performance.
We use spatial averages of NMSE and temporal correlation to identify the best-performing statistical models. Following Wilks (Reference Wilks2011), the spatial average of NMSE is estimated as
The spatial average of temporal correlation is estimated as
2.6. Statistical significance of metrics
The data being analyzed are serially correlated, so standard methods of estimating uncertainty and statistical significance for performance metrics are not applicable. Accordingly, resampling methods are used to evaluate the uncertainty and statistical significance of the performance metrics.
2.6.1. Uncertainty of spatial averaged metrics
For forecasts trained on CMIP6 data, the regression coefficients are not reestimated when comparing forecast performance across the candidate statistical models, since they are very robust due to the large training set. However, the regression coefficients depend on $ \lambda $ . To quantify uncertainty associated with $ \lambda $ , we randomly select 18 distinct years from the observational record, use the corresponding matched-pairs of target and predictors to determine $ \lambda $ (see Section 2.4.3), and then use the remaining 19 year record of observation data to evaluate the performance of the CMIP-single and -multi-task models. To preserve the serial correlation in the data, the paired target and predictors are sampled such that all sequential data for a given winter are sampled. This process is repeated 1,000 times to build up distributions of the spatially averaged NMSE and temporal correlation. The uncertainty is given as the 5th and 95th percentiles of these distributions.
For forecasts trained on observations, both $ \lambda $ and the regression coefficients have uncertainties that need to be quantified. To do this, a bootstrap sample of 18 distinct years from the observational record is used to train the statistical models and select $ \lambda $ through 10-fold cross-validation. The remaining 19 years of paired target variables and predictors are then used to evaluate the newly trained models. As before, the serial correlation in the data is preserved by selecting all sequential data for a given winter. This process is repeated 1,000 times for the OBS-single-task model to build distributions of the performance metrics. We use only 100 permutations for the OBS-multi-task because of the excessive computational requirements needed for the retraining and evaluation. The uncertainty is given as the 5th and 95th percentiles of these distributions.
We estimate uncertainties for the benchmark Nino3.4 statistical model, by randomly selecting 37 years of paired forecasts and verification data and estimating the performance metrics for the bootstrapped samples. Again, we preserve serial correlation by selecting sequential data for each winter. This process is repeated 1,000 times to build up distributions of the two metrics. The uncertainty for each metric is given as the 5th and 95th percentiles of the respective distributions.
The above uncertainty analyses aim to quantify the sampling effects of the observational data on estimates of statistical model performance. Given that the CMIP6 data are sufficiently long, we can also quantify the impacts of training set size on statistical model performance. To do so, we retrain the CMIP6-single-task model using a training data set that varies in length from 50 to 3000 years and evaluate model performance with respect to spatially averaged NMSE with the verification period fixed to 2000–2018. For instance, if the length of the training set is specified to be 3000 years, a training set of that length is found by sampling CMIP6 data with replacement, where each year is a set of consecutive 2-week forecasts for the target months of December–February. The data are sampled with replacement because there is not enough data to sample without replacement. To illustrate, if we wish to find 3000-year samples in the dataset, sampling without replacement would yield only two samples, which is insufficient for empirical uncertainty estimation. The process of selecting a training set, model retraining, and evaluation is repeated 60 times for each specified training set size. The uncertainty is given as the 5th and 95th percentiles of the respective distributions.
2.6.2. Statistical significance of similarity metrics
To quantify uncertainty in the temporal correlation we use a permutation method (DelSole et al., Reference DelSole, Trenary, Tippett and Pegion2017). Under the null hypothesis of no predictability, the forecasts and observations are independent. Thus, a permutation sample can be derived by separately permuting the year labels for forecast and observed data. The permutation sample preserves the mean and variance of the forecast and observations, but temporally misaligns the forecast-observation pairing. For this particular problem, the data are permuted for winter forecasts targeting the months of December–February. We permute the winter forecasts and verification data 5,000 times to create 5,000 realizations of correlation maps from the null hypothesis of no skill (or more precisely, the null hypothesis of exchangeability). The temporal correlation between the local forecast and verification is computed for each grid point separately and statistical significance is assessed locally by comparing the correlation value to the local 95th percentile from the permutation sample. In addition, the field significance is assessed based on the counts of positive correlations. Negative correlations are not considered skillful and therefore not included in the count. This process is repeated 10,000. The resulting cumulative distribution function is then used to determine the local p-value of the number of positive correlations.
To assess significance of the spatial correlation scores, we count the winters within which the number of positive scores exceeds the negative scores. This count is the total number of skillful forecast winters. Under the null hypothesis of no skill, the forecast has equal chances of producing positive or negative scores and the count of skillful winters should follow a binomial distribution with $ p=.5 $ . Although forecasts within a winter are serially correlated, forecast between winters are assumed independent and therefore the binomial distribution can be applied to the count of skillful winters.
3. Results
3.1. Statistical model selection
First, we evaluate the overall performance of the statistical models in terms of the spatial averages of NMSE and correlation coefficient to help isolate the best-performing statistical models. The confidence intervals of the two metrics are represented by the vertical bars in Figure 5a,b, where the bars capture the fluctuations of each metrics when different segments of the observations are used in statistical model evaluation. Confidence intervals are computed as described in Section 2.6.1.
Referring to Figure 5a, we see that the confidence intervals for the spatial averaged NMSE are near or above 1 for all candidate models. This illustrates the challenge of subseasonal forecasting: spatial average measures of local skill tend to be indistinguishable from values expected from no-skill models. Overall, the statistical models trained on CMIP6 data have lower NMSE than statistical models trained on observations, including the benchmark Nino3.4 model. In particular, the spatially averaged temporal correlation shown in Figure 5b, is low and generally negative for the statistical models trained using observation data. Only the CMIP6-single-task model consistently produces forecasts that are characterized by a positive spatially averaged temporal correlation. This result shows that training statistical models on dynamical model simulations can yield better forecasts than models trained on observations only. The OBS-multi-task model shows no improvement over the OBS-single-task. To better understand why statistical model performance improves when trained on CMIP6 data, we will compare the skill from the CMIP6-single-task (the best-performing model) with the similarly formulated OBS-single-task model. The statistical models are compared in terms of MSESS and its decomposition (see equation (8)) in Section 3.3.
3.2. Training set size and statistical model performance
A plausible explanation for why statistical models trained on CMIP6 simulations predict observations better than observation-trained models is that the size of the training data is much larger in the former than in the latter. To test this hypothesis, we re-train the CMIP6-single-task model using a training data set that varies in length from 50 to 3000 years and evaluate model performance with respect to spatially averaged NMSE for the verification period 2000–2018. The range of uncertainty is reported as the 5th–95th percentiles of the 60 estimates of spatially averaged NMSE for each sample size and shown as the vertical bars in Figure 6. The threshold for a skillful forecast is denoted by the horizontal black line which shows a NMSE of 1. When the training set size is 50 years, which is more than double the number of years available for training with observations, the CMIP6-single-task model has no skill. As the sample size is increased up to 3000 years, the spatially averaged NMSE systematically drops, yielding reliably skillful forecasts (NMSE < 1) when the training set size is 2000 years or greater. DelSole and Banerjee (Reference DelSole and Banerjee2017) arrive at a similar conclusion concerning the impacts of training set size on statistical model performance (see their Figure 12). Note that the NMSEs of the CMIP6-single-task model differ between Figures 5a and 6. Although this difference is not critical to our study, the reason for the difference is that the observations are bootstrapped in Figure 5a whereas they are not in Figure 6. To elaborate on this, Figure 5a shows the variation in skill when different segments of the observational record are used for validation and verification, but using fixed CMIP6 training set, while Figure 6 shows the variation in skill when different segments of the CMIP6 training set are used, but using fixed observational verification set. In essence, Figure 5a quantifies uncertainty from verification randomness under fixed training set, whereas Figure 6 quantifies uncertainty from training randomness under fixed verification set.
3.3. Statistical model comparison
The map of MSESS estimates from the CMIP6-single-task model, shown in Figure 7a, is characterized by regions of positive values along the west-coast, from northern California up through Washington and extending eastward into Idaho. Forecasts in the continental interior are characterized by negative MSESS. In contrast, we see MSESS values for the OBS-single-task model, shown in Figure 7d, are negative over much of forecast region.
To diagnose the sources of these errors, we decompose the MSESS according to equation (8) into contributions from the squared temporal correlation coefficient and amplitude bias. To aid in interpretation, all subsequent analyses are based on the square root of these two terms. Recall that since the forecasts and verification data are centered, the mean bias term is 0. Maps of the temporal correlation for the CMIP6-single-task and OBS-single-task models are shown in Figure 7b,e, respectively. For the CMIP6-single-task predictions, positive correlations are found over much of the western US, with the largest correlations along the west coast, and weaker and sometimes negative correlation in the continental interior. These regions of positive correlation determine the positive MSESS values along the west coast shown in Figure 7a. Positive temporal correlation coefficients recovered from OBS-single-task model are similarly concentrated in the Pacific Northwest. However, the OBS-single-task model has large areas of negative correlations in the southern portions of the forecast region. The positive correlations estimated for the OBS-single-task model correspond to regions of near-zero MSESS values shown in Figure 7d. For both set of statistical models, the contributions of temporal correlation to forecast skill are reduced by the amplitude bias, shown in Figure 7c,f. This analysis indicates that both statistical forecast systems underestimate the amplitude of the predicted temperature anomalies. However, this amplitude bias is markedly larger for predictions made by the OBS-single-task model.
Since the upper bound of the MSESS is determined by the temporal correlation, we quantify the statistical significance of this metric as discussed in Section 2.5. The percentage of grid-point forecasts that predict the correct sign of temperature anomalies and the corresponding p-values are listed in the title of Figure 7b,e. For CMIP6-single-task, 78% of the grid points are characterized by positive correlation, which has a p-value of .021 (i.e., it is field significant). In contrast, the OBS-single-task model has positive correlations only over 59% of the grid points, which has p-value of .26 (i.e., not field significant).
The maps of temporal correlation indicate that forecasts produced by statistical models trained on CMIP6 data are generally skillful over large portions of the western US. This metric says nothing about how well the model performs for any given 2-week average. Figure 8a,b shows the distribution of the spatial correlation as a function of time for the CMIP6-single-task and OBS-single-task models, respectively. For a given winter, the vertical bar represents the 25th–75th percentile of the spatial correlation and the black asterisk gives the median value. The number of forecasts for a given winter varies between 90 and 91, depending upon leap year. The number listed next to each median gives the percentage of forecasts that predict the correctly signed temperature anomalies. The CMIP6-single-task model produces forecasts where 14 of the 19 forecast winters predict the correctly signed temperature anomalies. Variations in the predictive capability are also evident for OBS-single-task, where 13 of the 19 winter forecasts produce correctly signed anomalies. These results suggest that while CMIP6-single-task model performs well overall in terms of individual grid points, the individual statistical forecast models do not consistently predict the large-scale pattern of temperature anomalies. That said, the overall skill of the CMIP6-single task and OBS-single-task model in predicting the spatial pattern of temperature anomalies is indistinguishable from a no-skill forecast for both statistical models. The study by Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), where forecasts are evaluated only with respect to spatial correlation, also found statistical prediction are characterized by a wide range in spatial correlation scores. It is worth pointing out that the performance of the statistical forecast system derived by Hwang et al. (Reference Hwang, Orenstein, Cohen, Pfeiffer and Mackey2019), cannot be directly compared with the statistical model evaluated here, because the statistical models are trained on different datasets, target a different range of dates, and use a slightly different metric.
In terms of aggregate metrics, the CMIP6-single-task model provides the most skillful forecasts. Here we examine a pair of high- and low-skill forecasts made by this statistical forecast system. The high-skill forecast, shown in Figure 9b, tends to capture the overall spatial structure of the observed temperature anomalies shown in Figure 9a. Notably, the amplitude of the predicted anomalies is reduced compared to observations. Consistent with analysis presented in Figure 7, the forecast is skillful in terms of predicting the correct sign of temperature anomalies, but underestimates their amplitude. The low-skill forecast, shown in Figure 9d, similarly predicts temperature anomalies that vary with latitude. However, the sign of the anomalies is completely opposite of the observed temperature anomalies shown in Figure 9c. Notably, the individual CMIP6-single-task models are capturing coherent patterns, which may suggest a common forcing mechanism is driving predictable variations over the target region.
3.4. Predictor selection
Lastly, we examine differences in the frequency of predictor selection between the CMIP6-single-task and OBS-single-task-models. Lasso assigns zero values to selected regression coefficients, clearly indicating that the associated predictor is less important than the other predictors with nonzero coefficients. Thus, we can use nonzero coefficients as a kind of predictor selection. Since lasso is fitted at each grid point, we can collect statistics of predictor selection across grid points. The frequency with which each predictor is selected using CMIP6 and observational training data is shown in Figure 10. Regardless of the forecast model, no predictor is selected consistently across all grid points. That said, a notable distinction between Figure 10a and b, is the larger percentage of predictor selection for lasso models fit using CMIP6 data. Generally, Laplacians time series from the Pacific are selected by a large percentage of the individual CMIP6-single-task models. In contrast, the OBS-single-task model shows less agreement in predictor selection across the location. The robust selection of key predictors for the CMIP6-trained models can likely be attributed to improved estimates of regression coefficients given the vast amount of data used in statistical model training.
4. Conclusion
This paper derives statistical models for predicting wintertime subseasonal temperature over the western US. Our goal was to show that statistical models trained on dynamical model data can be skillful, thereby demonstrating that dynamical models provide information relevant to subseasonal prediction. As a reference benchmark, we use simple linear regression to predict 2-week mean temperature at each grid point based on the Nino3.4 index. This benchmark is compared to models with tens more predictors derived from lasso. The lasso coefficients are estimated in two different ways, namely under a single-task or multi-task formulation. In all cases, the forecasts are validated on observational data that was excluded from the statistical model construction.
With respect to spatial averages of NMSE and temporal correlation, the statistical models trained on CMIP6 data are more skillful than statistical models trained on observation data. Performance of the most skillful statistical model, CMIP6-single-task, is characterized by spatially averaged NMSE <1 and a regionally average temporal correlation that is positive. This is not the case for the OBS-single-task model, a similarly configured set of statistical models trained on observation data, where spatially average NMSE is greater than 1 and the averaged temporal correlation is negative. A direct comparison of the MSESS between the CMIP6-single-task and OBS-single-task models, shows a greater portion of the forecast region with positive MSESS values for the CMIP6-single-task model. The MSESS for OBS-single-task is characterized by mostly negative values. The positive MSESS identified for the CMIP6-single-task model can be attributed to the statistical model’s ability to correctly predict the sign of the temperature anomalies (i.e., positive and field significant temporal correlation coefficients) for large portions of the western US. In contrast, the OBS-single-task model skill in predicting the correct sign of the temperature anomalies is largely limited to the Pacific Northwest. The greatest source of error between the two statistical models is the amplitude bias, where forecasts from the OBS-single-task model underpredict the magnitude of the temperature anomalies over much of the target region, which in turn, accounts for negative MSESS values. We demonstrate that size of the training set impacts skill of the statistical models and conclude from this that the increase in sample size from using simulated data more than compensates for the limitations due to imperfections in dynamical models. These results are encouraging and suggest that skill of statistical subseasonal prediction models can be further improved by using both dynamical model simulations and observations in statistical model training.
In general, the single-task models performed better than multi-task models, when evaluated in terms of spatial averages of NMSE and temporal correlation. Even still, the skill of the single-task models is nearly indistinguishable from no skill, with regards to these two performance metrics. This low skill on a local basis does not necessarily mean there is no significant skill for large-scale patterns. Notably, the individual CMIP6-single-task models predict coherent large-scale temperature anomalies over the target region. This suggests that statistical model skill might be further improved by isolating predictable large-scale patterns. Initially, lasso was chosen because of its potential for interpretability, but for our study, this turned out not to be the case. In Trenary and DelSole (Reference Trenary and DelSole2022), we develop a statistical technique that is able to identify predictable large-scale patterns despite the limited local predictability.
Acknowledgments
We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Table 1) for producing and making available their model output.
Author Contributions
L.T. is responsible for writing the original draft and all formal analysis and investigation. T.D. is responsible for conceptualization of the project and project supervision. Both authors contributed to revising and editing of the manuscript. All authors approved the final submitted draft.
Competing Interests
The authors declare no competing interests exist.
Data Availability Statement
For CMIP, the U.S. Department of Energy’s Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals and data can be found here https://esgf-node.llnl.gov/projects/cmip6/. The observational data are provided by the National Oceanic and Atmospheric Administration Climate Prediction Center and be downloaded directly from https://psl.noaa.gov/data/gridded/index.html.
Ethics Statement
The research meets all ethical guidelines, including adherence to the legal requirements of the United States.
Funding Statement
This research was supported primarily by the National Science Foundation (AGS 1822221). The views expressed herein are those of the authors and do not necessarily reflect the views of this agency.