Impact Statement
This paper addresses the lack of suitable metrics for assessing model performance at extreme events. We aim to develop a method that can identify whether the source of error is being driven by poor performance around extremes or performance on routine data. The authors propose a weighting function, derived from the observed data, to create a metric that enables practitioners to quantify this; the metric’s utility is demonstrated in a standard hydrology problem. The authors hope that this metric can facilitate the design and identification of machine learning models better at predicting extreme events, such as flooding, storms, or heatwaves.
1. Introduction
1.1. Background
Often in real world problems, the main body of data is less interesting than the outliers, whilst those extremes are of far more concern; meteorological phenomena, such as heatwaves, storms, or floods, can result in loss of human life, displacement of local populations, destruction of built environment infrastructure, and considerable economic disruption (Barriopedro et al., Reference Barriopedro, Fischer, Luterbacher, Trigo and García-Herrera2011; Changnon et al., Reference Changnon, Pielke, Changnon, Sylves and Pulwarty2000; Easterling, Reference Easterling2000; Fouillet et al., Reference Fouillet, Rey, Laurent, Pavillon, Bellec, Guihenneuc-Jouyaux, Clavel, Jougla and Hémon2006; Jonkman et al., Reference Jonkman, Maaskant, Boyd and Levitan2009). In the field of hydrology, for example, extensive flooding throughout the UK in 2012 and 2014, arising from successive extreme weather systems and subsequent rainfall, caused significant damage to infrastructure, with total economic damages estimated at approximately
0.6 billion and
1.3 billion, respectively (Parry et al., Reference Parry, Marsh and Kendon2013; The Environment Agency, 2013; Muchan et al., Reference Muchan, Lewis, Hannaford and Parry2015; Chatterton et al., Reference Chatterton, Clarke, Daly, Dawks, Elding, Fenn, Hick, Miller, Morris, Ogunyoye and Salado2016). In 2012 alone, approximately 20% of the days of the year were classed as flood days. Being able to predict the hydrological behavior in such events is critical for developing adaptation and mitigation strategies. To be able to develop hydrological models capable of predicting these singularities with a high degree of accuracy, we require robust tools to quantitatively assess their performance with regard to said singularities or extremes. However, many of the metrics used to assess and train machine learning models do not reflect the relative importance of these extremes.
For regression tasks, we commonly see metrics such as the root mean squared error (RMSE) or R 2 (Hastie et al., Reference Hastie, Tibshirani and Friedman2009). However, these metrics, whilst sensitive to the magnitude of discrepancy, are insensitive to the location of errors within the target domain (Dawson et al., Reference Dawson, Abrahart, Shamseldin and Wilby2006). Of course, one could choose to ignore certain subsets of the data and instead build metrics that are more sensitive to subsets of the target domain, for example, by applying a relevance mapping based on statistical thresholds (Torgo and Ribeiro, Reference Torgo and Ribeiro2007; Ribeiro and Moniz, Reference Ribeiro and Moniz2020), though the selection of statistical thresholds might be considered somewhat arbitrary.
The field of extreme value theory, which is concerned with the asymptotic distribution of maxima (Haan and Ferreira, Reference Haan and Ferreira2006), has been used to develop modeling frameworks for predicting extreme events (Ding et al., Reference Ding, Zhang, Pan, Yang and He2019; Siffer et al., Reference Siffer, Fouque, Termier and Largouet2017; Boulaguiem et al., Reference Boulaguiem, Zscheischler, Vignotto, Van Der Wiel and Engelke2022). This research includes the development of an extreme loss function based on the Fisher–Tippett–Gnedenko theorem for neural models (Ding et al., Reference Ding, Zhang, Pan, Yang and He2019) and outlier detection using automatically setting thresholds (Siffer et al., Reference Siffer, Fouque, Termier and Largouet2017). Although these methods allow for evaluating model performance on extreme values, our objective is to develop a framework that does not require event separation and that can be applied across the target domain more generally, covering local minima and maxima as well, such as those for Gaussian mixture distributions.
In this work, we develop a dimensionless metric for assessing the distribution of errors in the target domain more generally. This metric being dimensionless enables comparison between different instances in the same field. If we again use the hydrological context as a motivating example, we might wish to apply the same modeling approach to different rivers. However, their average discharge values will likely be different and, thus, the unnormalized mean error arising from model predictions will be of different scales. Our metric enables quantification of relative performance around extremes compared with what we term routine data, being the data that are considered close to the peaks of probability density, with respect to the probability distribution that best fits the observed data. Our hope is that doing so can supplement analysis and enable the evaluation of model performance according to the distribution of the observations with a view to more targeted model optimization. We begin by reintroducing a commonly used error metric and from it derive a new error metric. We then provide synthetic, illustrative data to which we apply this metric along with a real-world application, one where extremes are of considerable importance.
2. Methodology
2.1. Root mean squared error
Our proposed metric is derived from the ubiquitous RMSE. RMSE is expressed as the sum of the error calculated between all pairwise observations,
$ {y}_i $
, and predictions,
, as in Equation 2.1; in a machine learning context, it is typically evaluated both during training, giving performance for the fit over the training set, and when testing a model to evaluate its ability to generalize (Hastie et al., Reference Hastie, Tibshirani and Friedman2009).

The RMSE lies within the interval
$ \left[0,\infty \right) $
, where
$ 0 $
is a perfect score, and higher scores indicate lower model performance. Whilst this metric provides an overall assessment of the accuracy of a model, it has two flaws: (1) RMSE’s lack of variance to data rescaling means that it does not allow for comparison of a single model archetype across application domains where the observations are of different scales; and (2) RMSE does not provide sensitivity to the distribution of errors. The implication of the latter issue, which is common to other error metrics such as R
2, is that these metrics cannot be used to identify whether or not extremes are the main contributor to the error. For the real-world problems identified in Section 1, being able to assess predictive power around extremes is essential.
2.2. Reflective error
We define our reflective error (RE) metric as the ratio between a weighted form of RMSE, such as that proposed by (Ribeiro and Moniz, Reference Ribeiro and Moniz2020), and the standard RMSE. It alleviates both of the issues facing RMSE because (1) dividing the weighted form of RMSE by the standard form of RMSE is a normalization process that places all applications of RE onto the same scale and (2) the weighting function we utilize is derived by fitting an empirical distribution to the target outputs and, thus, provides local sensitivity by minimizing routine error relative to extreme error. For machine learning applications, this would be fitted to the training data targets. Throughout this paper, we refer to this process as fitting the underlying probability distribution,
$ U(y) $
. From this fitted probability distribution,
$ U(y) $
, we define the reflective weighting function,
$ \Psi (y) $
, in Equation 2.2, with scaling factor
$ \kappa =\underset{y}{\max}\left(U(y)\right) $
.

Where
$ \alpha =1 $
and
$ \beta =1 $
, such that the weighting function is applied to the error at any given
$ y $
and is a mapping
$ \Psi (y):Y\to \left[0,1\right]\forall $
$ y\in Y $
. Consequently,
$ \Psi (y)\to 1 $
for extremes and
$ \Psi (y)\to 0 $
for routine data. Thus, we formulate our metric in terms of
$ \Psi (y) $
and the squared error in Equation 2.3:

For most applications, RE lies within the interval
$ \left[0,1\right] $
, though for instances where the total error is
$ 0 $
or
$ \infty $
RE is undefined. Where RE is defined,
$ 0 $
would indicate that all error comes from the point of highest probability density and
$ 1 $
would indicate that all error comes from the most extreme data point; whilst these scores are unlikely to occur in real problems, we still intuit that high RE indicates the error is primarily extreme driven and low RE that the error is routine driven. This stationary expression of RE is applied to a fictitious dataset in Section 2.3.
Should the practitioner be dealing with a distribution with multiple tails, but where only one is of interest or they need separate treatment, then the weighting could be split piecewise about the global maximum probability density or other local maxima. For example, if considering a two-tailed distribution, such as a Gaussian, and the higher extremes were of more concern, then the following weighting in Equation 2.4 could be applied.

Where
$ \Psi (y) $
is the base form of our metric, as in Equation 2.3, and
$ {\Psi}_{S^{+}} $
is the piecewise form.
2.3. Stationary synthetic data
In order to demonstrate how the sensitivity of RE to extreme versus routine errors, we present a one-dimensional problem, using a normal distribution, to which we fit a model with some error. We then proceed to magnify that error at routine datapoints and extremes to show how RE varies whilst RMSE and R 2 are relatively unaffected.
We create an artificial, normally distributed dataset
$ X\sim \mathcal{N}\left({\mathrm{3.5,0.75}}^2\right) $
of 500 data points, which is of similar size to datasets commonly used to demonstrate statistical methods, such as the Iris petal or Ionosphere datasets Sigillito et al. (Reference Sigillito, Wing, Hutton and Baker1989); Bezdek et al. (Reference Bezdek, Keller, Krishnapuram, Kuncheva and Pal1999). This dataset will form our observations, and our predictive “model” is the same data with the addition of some Gaussian noise term
$ {\unicode{x025B}}_1\sim \mathcal{N}\left({\mathrm{0,0.1}}^2\right) $
. This model fits the data with
$ RMSE=0.203 $
,
$ {R}^2=0.924 $
, and
$ RE=0.542 $
.
For both of the error scenarios, we add some randomly generated error,
$ {\unicode{x025B}}_2\sim \mathcal{N}\left({\mathrm{0,1.0}}^2\right) $
. For our routine error scenario, we add this error to the 100 data points closest to the mean, whilst for the extreme error scenario, we add this to the 100 data points furthest from the mean, in either direction. The predictions against observations for both of these scenarios, along with a histogram of the original dataset and accompanying relative weighting function,
$ \Psi (y) $
, is shown in Figure 1.

Figure 1. Normally distributed dataset of synthetic observations with fitted probability density and reflective weighting functions (left) with the predictions versus observations for the two perturbation scenarios (center and right).
For the routine error scenario, the
$ RMSE=0.48 $
,
$ {R}^2=0.60 $
, and
$ RE=0.31 $
; for the extreme error scenario, the
$ RMSE=0.49 $
,
$ {R}^2=0.59 $
, and
$ RE=0.80 $
. The values of the standard error metrics we used, the RMSE and R
2, for both of the scenarios are roughly equivalent. However, there is a significant discrepancy between the RE values; a high value of RE for the extreme scenario indicates the error is skewed towards extremes, and the opposite is true for the routine error scenario. Therefore, our RE metric is sensitive to the location of error and is identifying the corresponding relative contribution of error as intended.
3. Nonstationary methodology
3.1. Nonstationary RE
Our initial definition of RE holds assuming that the
$ U(y) $
is stationary and does not change over time; however, for some practical problems, the stationary assumption is not appropriate and the above, stationary form of RE does not immediately apply. For example, if we consider the trend in mean global surface temperature over the past 150 years, then what was extreme in 1850 would not be considered extreme today (V. Masson-Delmotte et al., Reference Masson-Delmotte, Zhai, Pirani, Connors, Péan, Berger, Caud, Chen, Goldfarb, Gomis, Huang, Leitzell, Lonnoy, Matthews, Maycock, Waterfield, Yelekçi, Yu and Zhou2021); thus, our definition of extreme must change over time and so too the weighting we apply. We therefore extend our metric to enable application to nonstationary problems, where the probability distribution of the target domain data and the associated parameters change over time, and we require local sensitivity.
If we take some subdomain,
$ \phi $
, of the whole target domain,
$ Y $
, such that the subdomain could be a series of points between two limits or at a single point in time, then we can determine a subdomain-specific weighting function,
$ {\Psi}_{\phi }(y) $
. For a range of points between two limits, such as for a step change, we would empirically fit a probability distribution,
$ {U}_{\phi }(y) $
, to the data between those limits; if, however, there were a trend in the probability distribution, then we could empirically derive parameters of the probability distribution at a given point according to some binning strategy. This weighting function,
$ {\Psi}_{\phi }(y) $
, is then expressed in terms of the subdomain’s probability distribution,
$ {U}_{\phi }(y) $
, as in Equation 2.2, with a scaling factor
$ \kappa =\underset{y}{\max}\left({U}_{\phi }(y)\right) $
.

Where
$ {\Psi}_{\phi}\left({y}_i\right) $
is the weighting to be applied to the error at any given
$ y $
based on the underlying, probability distribution corresponding to values
$ y $
within the local subdomain,
$ \phi $
, of the target domain,
$ Y $
. More concretely, for any
$ \phi \subseteq Y $
, we can define a weighting function
$ {\Psi}_{\phi }(y):\phi \left[0,1\right] $
. Note that for stationary cases, where the subdomain
$ \phi =Y $
, this simplifies to the form of RE, expressed in Equation 2.3.
For the set of subdomains,
$ \Phi $
, defining the temporal limits
$ \forall \phi \in \Phi $
impacts the resulting probability distributions,
$ {\Psi}_{\phi }(y) $
, and care must therefore be taken on the identification of the evolution of nonstationary trends over time. Methods for establishing nonstationary trends (Bell, Reference Bell1984; Box and Tiao, Reference Box and Tiao1965; Wu et al., Reference Wu, Huang, Long and Peng2007) can be used to establish the nonstationarity of statistical parameters and guide the construction of
$ \Phi $
. We also note that characteristic information or extrema may be missing from undersampled or poorly aliased data (Proakis and Manolakis, Reference Proakis and Manolakis2007), for example, those not complying with the Nyquist rate; in such cases, the probabilistic distribution to be fitted to any subdomain,
$ \phi $
, would likely not be representative of the true signal and the application of nonstationary RE difficult.
3.2. Nonstationary fictitious data
To illustrate the need for local sensitivity, for cases where
$ {\phi}_n\subset Y $
, we present a fictitious, nonstationary signal with step changes, requiring the empirical derivation of the probability distribution before and after each step change. These observations are generated according to the piece-wise function, in Equation 3.2.

The probability distribution over the entire domain temporal domain is, therefore, given by a Gaussian mixture distribution. Similar to the process for the simple stationary experiment, we create our model by adding some Gaussian error to the observations,
$ \unicode{x025B} \sim \mathcal{N}\left({\mathrm{0,0.1}}^2\right) $
. We then take a single prediction from the second subdomain, where
$ {t}_1\le t<{t}_2 $
and
$ {\mu}_2=1 $
, and add significant error to it such that it lies close to the mean,
$ {\mu}_3=-1 $
, of the third subdomain,
$ {t}_2\le t<{t}_3 $
, as shown in Figure 2. We will term this anomalous data point
$ {y}_{\delta } $
. If we were to use the stationary form of RE, then given that this error lies close to the maximum,
$ \Psi \left({y}_{\delta}\right)\to 0 $
; however, the nonstationary form of RE is such that
$ \Psi \left({y}_{\delta}\right)\to 1 $
and the effect of
$ {y}_{\delta } $
is maximized. The resulting
$ RE=0.45 $
if we assumed stationarity but, by accounting for temporal variation in the fitted probability distributions,
$ RE=0.83 $
, with the metric showing more skew towards errors around the extremes. Obviously, this dataset is fictitious but it does demonstrate a potentially desirable change in behavior in the face of nonstationarity.

Figure 2. Synthetic temporally variant dataset with predictions and observations shown about the mean function and within 2 standard deviations with anomalous observation highlighted (left); and the overall probability distribution fitted to the data with normalized histogram of the observations (right).
4. Application
As a real-world example, we present a hydrological modeling problem, using machine learning, where the extremes are of particular concern, given that these extremes have the potential to give rise to significant flooding Faulkner (Reference Faulkner2008); Kendon and McCarthy (Reference Kendon and McCarthy2015); Muchan et al. (Reference Muchan, Lewis, Hannaford and Parry2015). The use and potential of machine learning in hydrology has been covered in literature Abrahart et al. (Reference Abrahart, Anctil, Coulibaly, Dawson, Mount, See, Shamseldin, Solomatine, Toth and Wilby2012); Besaw et al. (Reference Besaw, Rizzo, Bierman and Hackett2010); Govindaraju et al. (Reference Govindaraju, Ramachandra Rao and Singh2000), so we assume its application is likewise suitable here. In addition to RMSE, we will also use Nash-Sutcliffe efficiency (NSE) Nash and Sutcliffe (Reference Nash and Sutcliffe1970); mathematically similar to
$ {R}^2 $
, NSE is a normalization that enables comparison across catchments. Whilst NSE and other related metrics, such as Kling-Gupta Efficiency Gupta et al. (Reference Gupta, Kling, Yilmaz and Martinez2009), are often used in conjunction with RMSE Ritter and Muñoz-Carpena (Reference Ritter and Muñoz-Carpena2013), our metric will supplement this further through the attribution of error to extremes, as mentioned at the beginning of Section 2.
Using (Rouse et al., Reference Rouse, Khamis, Hosking, McRobie and Shuckburgh2025) as the experimental basis, we take mean gauged daily streamflow observations for a pair of river catchments, the River Avon at Bathford and the River Exe at Pixton, from the United Kingdom’s National River Flow Archive (NRFA) provided by the UK Centre for Ecology & Hydrology UK Centre for Ecology & Hydrology (2022). These two rivers have been selected from the database due to the relatively different modeling challenges they present, in that the Exe at Pixton is a much smaller, flashier catchment, with an area of 159.7 km2 compared to 1552 km2 for the Avon at Bathford, and will thus likely make the prediction of extremes harder. This is exacerbated by the very low permeability of the catchment, which is shown amongst other characteristics in Figure 3.

Figure 3. Maps of elevation, land use, and geology for the Avon at Bathford and Exe at Pixton (note that the catchments are shown at different scales for visibility). Keys corresponding to each map represent the proportion of each within respective subcategories. Adapted from the National River Flow Archive UK Centre for Ecology & Hydrology (2022).
Input meteorological data has been taken from ERA5, the fifth generation of global climate reanalysis modeling and data assimilation output produced by The European Centre for Medium-Range Weather Forecast’s (ECMWF), which we assume to be congruent with observations Hersbach et al. (Reference Hersbach, Bell, Berrisford, Biavati, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Rozum, Schepers, Simmons, Soci, Dee and Thépaut2023, Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, Rosnay, Rozum, Vamborg, Villaume and Thépaut2020); Tarek et al. (Reference Tarek, Brissette and Arsenault2020); in spite of this, we note that the resolution of ERA5 data, at a grid-scale of 31 km, may not fully capture the spatial dynamics driving flow in a small catchment like the Exe at Pixton but that downscaling the precipitation is out of scope for this work. We use 14 days’ worth of surface precipitation and daily average temperature, relative humidity, and resultant wind speed at a pressure level of 1000 hPa within the catchment along with antecedent proxies for soil moisture, as per (Rouse et al., Reference Rouse, Khamis, Hosking, McRobie and Shuckburgh2025). Similarly, we adopt the model setup and training procedure for a simple artificial neural network Rumelhart et al. (Reference Rumelhart, Hinton and Williams1986), with layer nodes of
$ 56\to 16\to 4\to 1 $
, Sigmoid Linear Unit activation function Ramachandran et al. (Reference Ramachandran, Zoph and Le2017), and the Adam optimization algorithm Kingma and Ba (Reference Kingma and Ba2017). The data, running from 1979 to 2019, is split into the same test, validation, and training subsets, with the training set containing the years 1979–2008 and the test set containing the years 2011–2019.
We have assumed that the streamflow,
$ Y $
, is simulated by a lognormal distribution with parameters,
$ \mu $
and
$ \sigma $
, such that
$ Y\sim \mathrm{\mathcal{L}}\mathcal{N}\left(\mu, {\sigma}^2\right) $
. Using maximum likelihood estimation (MLE) fitted to a 365-day rolling window of streamflow, we determined that
$ \mu $
and
$ \sigma $
did not exhibit any significant temporal trend over the study period, as shown in Figure 4. Therefore, although climate is nonstationary, we believe the stationary assumption to be valid, and so we use stationary RE. For problems with clear nonstationarity, we would use the nonstationary formulation of Equation 3.1 and, given that we would expect a gradual rather than step change in statistical distribution, would apply a rolling window of appropriate timescale such that the weighting function,
$ {\Psi}_{\phi}\left({y}_i\right) $
, is likewise continuous. The 365-day rolling window as used above window would likely capture seasonality but could be set longer to capture longer-term climatic oscillations, such as the El Niño–Southern Oscillation, if pertinent.

Figure 4. Parameters for a log-normal probability density function fitted to a 365-day rolling window of streamflow observations for the River Avon and the River Exe.
Test set performance metrics are:
$ RMSE=9.35{m}^3{s}^{-1} $
,
$ NSE=0.87 $
, and
$ RE=0.92 $
for the River Avon at Bathford; and
$ RMSE=2.68{m}^3{s}^{-1} $
,
$ NSE=0.75 $
, and
$ RE=0.95 $
for the River Exe at Pixton. A subset of the observed and predicted streamflow, specifically for the year 2012 when significant streamflow events occurred, is shown in Figure 5. This example demonstrates the RMSE’s scaling issues, given that it is greater for the Avon at Bathford than for the Exe at Pixton due to the former’s higher capacity and average streamflow, yet the overall fit for the Avon at Bathford is better, as evidenced by the better NSE score. Our metric, RE, indicates that the driver of this error for both rivers is weaker model performance around the extremes, with this being more problematic for the River Exe, quantifying that relative inability to predict the Spring and Winter extremes in the year 2012.

Figure 5. Predicted and observed streamflow time series in the year 2012 for the River Avon and River Exe, with predictions generated using a basic artificial neural network model.
We have demonstrated in this paper the ability of RE to provide quantification of the shape of model error, both with unrealistic synthetic and real-world data. In the real-world context, by utilizing RE conjunction with existing error metrics, RE provides important quantification of the distribution of error relative to the observed dataset. This has the potential to enable more targeted model optimization. Furthermore, at the end of Section 2.2, we described the adaptability of RE to focus on one tail; in practice, hydrology is one area where this approach could be more suitable if investigating floods and droughts separately.
Although we did not test RE on discrete probability distributions, the extension to such situations is such that the weight function,
$ \Psi (y) $
, generates a series of discrete weights, rather than a continuous one. However, for uniformly distributed data, the value of
$ \Psi (y) $
is trivially either
$ 1 $
or
$ 0 $
and is, consequently, only sensitive to errors outside of the bounds of the distribution.
5. Reflective loss function
We further extend this framework to the loss function in a neural network. The mean squared error (MSE) is a commonly used loss function in neural networks (Hastie et al., Reference Hastie, Tibshirani and Friedman2009; Bishop, Reference Bishop2016) to which our modification is easily applied, specifically the unnormalized form of RE. This loss function,
$ {L}_{RE} $
, we propose for training parametric machine learning models, in place of the often used MSE, it increases the relative penalty applied to extremes and is expressed in Equation 5.1.

Whereas for the error metric, we let
$ \alpha =1 $
and
$ \beta =1 $
, we now consider
$ \alpha $
and
$ \beta $
as tunable hyperparameters. Thus, we can adjust the penalty applied to routine and extreme data. In order to minimize the routine loss and maximize the extreme loss, the condition
$ \alpha \le \beta $
must be true.

To test the loss function, we utilize the same hydrological problem as described in the previous section, modifying the values for
$ \alpha $
and
$ \beta $
using a grid search method whilst keeping all other aspects of network architecture and training the same, such as the training and test set compositions. The results different
$ \alpha $
and
$ \beta $
pairs are shown in Figure 6, in terms of the difference between
$ \beta $
and
$ \alpha $
.

Figure 6. Performance in terms of NSE and RE for different
$ \alpha $
and
$ \beta $
pairs in the Reflective Loss Function used for training a neural network on streamflow data from the Rivers Avon ((a) and (b)) & Exe ((c) and (d)).
Our objective is to identify values that improve NSE whilst also reducing RE. From the investigated pairs of
$ \alpha $
and
$ \beta $
, low values of
$ \alpha $
resulted in poor performance due to their adverse impact on the back propagation algorithm. At suitably large
$ \alpha $
, where
$ \alpha \ge \frac{1}{2} $
, and for the parameter values where the loss function would minimize the error for routine data to 0, that is where
$ \alpha =\beta $
, the network achieves an improvement in RE for a minor reduction in the standard error metrics. However, for the cases where
$ \alpha <\beta $
, we achieve an improvement in RE without sacrificing overall performance. In other words, we achieve better extreme predictive capability over that for routine data. For most values of
$ \alpha $
, the improvement between different pair values of
$ \alpha $
and
$ \beta $
, for
$ \alpha <\beta $
, is more marginal but these combinations achieve better performance than where
$ \alpha =\beta $
. The best results are obtained for
$ 0.5\le \alpha \le 5 $
and at
$ 0.5\le \left(\beta -\alpha \right)\le 1 $
. Therefore, when optimizing neural models using RE as a Loss function, values such as
$ \alpha =1 $
and
$ \beta =2 $
provide improved results. Using a hyperparameter optimization framework, such as Bayesian optimization (Snoek et al., Reference Snoek, Larochelle and Adams2012), might be a more expedient route to identifying optimal values for
$ \alpha $
and
$ \beta $
pairs than the grid search implemented here.
We further demonstrate this by showing improved predictive performance for some of the peaks in the 2012–2013 and 2013–2014 winter periods from the RE Loss model, with parameters
$ \alpha =1 $
$ \beta =2 $
, compared to the MSE Loss model in Figure 7. Although the improved extreme performance is noticeable at extremes, we remain cognizant of the fact that this method cannot address deficiencies in the representation of extremes within the training data. For example, instances such as the peaks in the 2012 period for the River Exe at Pixton are not represented within the training data at all and addressing issues with both the spatial resolution of and bias in the precipitation data could further correct extreme performance; further analysis on the cause of these extremes is not presented within this study but it does highlight the need for domain expertise in the construction of modeling frameworks.

Figure 7. Focused time series for winter periods for the River Avon showing observations and predictions generated from the Neural Network model using both the MSE Loss and RE Loss functions.
6. Conclusion
The RE metric we have presented here is not intended to supplant the existing error metrics; indeed, it does not directly provide quantification of the magnitude of error. Instead, our metric enables the quantification of how a model’s performance is distributed, such as around the extremes of a dataset, and supplements quantification of the magnitude of error. Therefore, we recommend that RE is used alongside existing error metrics to provide a more comprehensive view of model performance.
RE can help to provide a method of quantifying relative extreme performance through a single robust and coherent number in real-world data problems, such as the hydrological problem we presented. This could extend to all manner of fields where predicting the response of systems is of significantly more value around extremes, such as the treatment of the impacts from meteorological phenomena in a nonstationary climate, where the probability distributions are subject to change; for example, there is Bhatia et al. (Reference Bhatia, Vecchi, Knutson, Murakami, Kossin, Dixon and Whitlock2019); Murakami et al. (Reference Murakami, Delworth, Cooke, Zhao, Xiang and Hsu2020); Yoshida et al. (Reference Yoshida, Sugi, Mizuta, Murakami and Ishii2017).
We also presented a framework for tailoring the optimization of machine learning models, specifically neural models, to enable better prediction of extremes. Again, this provides an improvement that should not come at the expense of other facets of machine learning model training, such as ensuring representation of extremes within the data, but this is an adaptation that can be included to help drive performance toward extremes at minimal increase in complexity as far as the practitioner might be concerned.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.16.
Author contributions
Conceptualization: R.E.R. and H.M.; Data Curation: R.E.R.; Funding acquisition: R.E.R., S.H., A.M., and E.S.; Methodology: R.E.R. and H.M.; Project administration: R.E.R., S.H., A.M., and E.S.; Software: R.E.R.; Supervision: S.H., A.M., and E.S.; Visualization: R.E.R. and H.M.; Writing—Original draft: R.E.R. and H.M.; Writing—Review and editing: R.E.R. and H.M.
Competing interests
The authors declare no competing interests exist.
Data availability statement
The data required to reproduce these results are available at the National River Flow Archive Dixon et al. (Reference Dixon, Hannaford and Fry2013) using the data portal https://nrfa.ceh.ac.uk/data/search or API https://nrfaapps.ceh.ac.uk/nrfa/nrfa-api.html and from the Copernicus Climate Data Store https://doi.org/10.24381/cds.143582cf. We have made the code to run the model available at https://doi.org/10.5281/zenodo.14933256.
Ethical standard
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Funding statement
This research was supported by a grant from the Engineering and Physical Sciences Research Council (EP/R512461/1). This research was supported by a Fellowship from the Royal Commission for the Exhibition of 1851.