Impact statement
This review addresses a gap that different types of ML methods for hydrological time series prediction in unmonitored sites are often not compared in detail and best practices are unclear. We consolidate and synthesize state-of-the-art ML techniques for researchers and water resources management, where the strengths and limitations of different ML techniques are described allowing for a more informed selection of existing ML frameworks and development of new ones. Open questions that require further investigation are highlighted to encourage researchers to address specific issues like training data and input selection, model explainability, and the incorporation of process-based knowledge.
1. Introduction
Environmental data for water resources often does not exist at the appropriate spatiotemporal resolution or coverage for scientific studies or management decisions. Although advanced sensor networks and remote sensing are generating more environmental data (Hubbard et al., Reference Hubbard, Varadharajan, Wu, Wainwright and Dwivedi2020; Reichstein et al., Reference Reichstein, Camps-Valls, Stevens, Jung, Denzler and Carvalhais2019; Topp et al., Reference Topp, Pavelsky, Jensen, Simard and Ross2020), the amount of observations available will continue to be inadequate for the foreseeable future, notably for variables that are only measured at a few locations. For example, the United States Geological Survey (USGS) streamflow monitoring network covers less than 1% of stream reaches in the United States, with monitoring sites declining over time (Ahuja, Reference Ahuja2016; Konrad et al., Reference Konrad, Anderson, Restivo and David2022), and stream coverage is significantly lower in many other parts of the world. Similarly, just over 12,000 of the 185,000 lakes with at least 4 hectares in area in the conterminous US (CONUS) have at least one lake surface temperature measurement (Willard et al., Reference Willard, Read, Topp, Hansen and Kumar2022b), and less than 5% of those have 10 or more days with temperature measurements (Read et al., Reference Read, Carr, De Cicco, Dugan, Hanson, Hart, Kreft, Read and Winslow2017). Since observing key variables at scale is prohibitively costly (Caughlan and Oakley, Reference Caughlan and Oakley2001), models that use existing data and transfer information to unmonitored systems are critical to closing the data gaps. The problem of streamflow and water quality prediction in unmonitored basins in particular has been a longstanding area of research in hydrology due to its importance for infrastructure design, energy production, and management of water resources. The need for these predictions has grown with changing climate, increased frequency and intensity of extreme events, and widespread human impacts on water resources (Blöschl et al., Reference Blöschl, Bloschl, Sivapalan, Wagener, Savenije and Viglione2013; Guo et al., Reference Guo, Zhang, Zhang and Wang2020b; Salinas et al., Reference Salinas, Laaha, Rogger, Parajka, Viglione, Sivapalan and Blöschl2013; Sánchez-Gómez et al., Reference Sánchez-Gómez, Martínez-Pérez, Sylvain, Sastre-Merlín and Molina-Navarro2023; Zounemat-Kermani et al., Reference Zounemat-Kermani, Batelaan, Fadaee and Hinkelmann2021).
A variety of models—process-based, machine learning (ML), and statistical models—have been used to predict key ecosystem variables. These models can be applied to a few categories of applications where data are unavailable at the spatial and temporal scales needed for environmental decision-making. The first is based on data completeness, which could occur when (a) a site is not monitored at all; (b) a site is monitored, but the time series has large chunks of missing data or is available for a limited period; (c) a site is monitored, but the time series has sporadic missing data. A second is based on data resolution when (a) a site is monitored but at a lower resolution than desired, or (b) a site is not monitored but data for other covariates are available. In this paper, we define the problem of predictions in unmonitored sites or the ‘unmonitored’ scenario as specifically the cases where the locations have either no monitoring data at all for the variable of interest or sufficiently sparse or low-resolution monitoring data where it effectively can be considered as an unmonitored site. In cases where data need to be gap filled or extended forward or backward in time, a model can be trained on a time period within one site and then predictions are made for new time periods at the same site. This is often referred to as the monitored prediction scenario or the gauged scenario in streamflow modeling. While temporal predictions in monitored sites are important, spatial extrapolation to unmonitored sites is even more crucial, because the vast majority of locations remain unmonitored for many environmental variables of interest.
Traditionally, water resources modeling in unmonitored sites has relied on the regionalization of process-based models. Regionalization techniques relate the parameter values of a model calibrated to the data of a monitored site to the inherent characteristics of the unmonitored site (Razavi and Coulibaly, Reference Razavi and Coulibaly2013; Seibert, Reference Seibert1999; Yang et al., Reference Yang, Magnusson and Xu2019b). However, large uncertainties and mixed success have prevented process-based model regionalization from being widely employed in hydrological analysis and design (Bastola et al., Reference Bastola, Ishidaira and Takeuchi2008; Prieto et al., Reference Prieto, Le Vine, Kavetski, Garcia and Medina2019; Wagener and Wheater, Reference Wagener and Wheater2006). A major issue that makes process-based model calibration and regionalization difficult is the complex relationships between model parameters (e.g., between soil porosity and soil depth in rainfall-runoff models) (Kratzert et al., Reference Kratzert, Herrnegger, Klotz, Hochreiter and Klambauer2019a; Oudin et al., Reference Oudin, Andreassian, Perrin, Michel and Le Moine2008), which leads to the problem of equifinality (Beven and Freer, Reference Beven and Freer2001) where different parameter values or model structures are equally capable of reproducing a similar hydrological outcome. Additionally, process models require significant amounts of site-specific data collection and computational power for calibration and benchmarking, which is expensive to generate across diverse regions of interest.
On the other hand, ML models built using data from large-scale monitoring networks do regionalization implicitly without the dependence on expert knowledge, pre-defined hydrological models, and also often without any hydrological knowledge at all. Since ML models have significantly more flexibility in how parameters and connections between parameters are optimized, unlike process-based models where each parameter represents a specific system component or property, issues relevant to equifinality become largely irrelevant (S. Razavi et al., Reference Razavi, Hannah, Elshorbagy, Kumar, Marshall, Solomatine, Dezfuli, Sadegh and Famiglietti2022). In recent years, numerous ML approaches have been explored for environmental variable time series predictions in unmonitored locations that span a variety of methods and applications in hydrology and water resources engineering. Most of the ML approaches for predictions in unmonitored regions focus on stream flows, but are rapidly expanding to other variables like river and lake water quality as data collection and modeling continue to advance such as soil moisture (Fang et al., Reference Fang, Pan and Shen2018), stream temperature (Rahmani et al., Reference Rahmani, Shen, Oliver, Lawson and Appling2021; Weierbach et al., Reference Weierbach, Lima, Willard, Hendrix, Christianson, Lubich and Varadharajan2022), and lake temperature (Willard et al., Reference Willard, Read, Topp, Hansen and Kumar2022b). The ML models have continually outperformed common process-based hydrological models in terms of both predictive performance and computational efficiency at large spatial scales (Kratzert et al., Reference Kratzert, Klotz, Herrnegger, Sampson, Hochreiter and Nearing2019b; Oğuz and Ertuğrul, Reference Oğuz and Ertuğrul2023; Read et al., Reference Read, Jia, Willard, Appling, Zwart, Oliver, Karpatne, Hansen, Hanson and Watkins2019; Sun et al., Reference Sun, Jiang, Mudunuru and Chen2021a). Specifically, deep learning architectures like long short-term memory (LSTM) networks have been increasingly used for time series predictions due to their ability to model systems and variables that have memory, that is, where past conditions influence present behavior (e.g., snowpack depth; (Lees et al., Reference Lees, Reece, Kratzert, Klotz, Gauch, De Bruijn, Kumar Sahu, Greve, Slater and Dadson2022)). LSTMs have been shown to outperform both state-of-the-art process-based models and also classical ML models (e.g., XGBoost, random forests, support vector machines) for applications like lake temperature (Daw et al., Reference Daw, Thomas, Carey, Read, Appling and Karpatne2020; Jia et al., Reference Jia, Willard, Karpatne, Read, Zwart, Steinbach and Kumar2021a; Read et al., Reference Read, Jia, Willard, Appling, Zwart, Oliver, Karpatne, Hansen, Hanson and Watkins2019), stream temperature (Feigl et al., Reference Feigl, Lebiedzinski, Herrnegger and Schulz2021; Weierbach et al., Reference Weierbach, Lima, Willard, Hendrix, Christianson, Lubich and Varadharajan2022), and groundwater dynamics (Jing et al., Reference Jing, He, Tian, Lancia, Cao, Crivellari, Guo and Zheng2022) predictions among many others. Other deep learning architectures effective for time series modeling, but seen less often in hydrology, include the simpler gated recurrent unit (GRU) (Chung et al., Reference Chung, Gulcehre, Cho and Bengio2014) or more recent innovations like the temporal convolution network (TCN) (Lea et al., Reference Lea, Flynn, Vidal, Reiter and Hager2017), or spatiotemporally aware process-guided deep learning models (Topp et al., Reference Topp, Barclay, Diaz, Sun, Jia, Lu, Sadler and Appling2023). Recent advancements have also introduced transformer-based methods (Yin et al., Reference Yin, Guo, Zhang, Chen and Zhang2022), which are architecturally able to model long-term dependencies more effectively than LSTM (Wen et al., Reference Wen, Zhou, Zhang, Chen, Ma, Yan and Sun2022; Zeyer et al., Reference Zeyer, Bahar, Irie, Schlüter and Ney2019). Transformers have been recently shown to occasionally outperform other methods for streamflow prediction (Amanambu et al., Reference Amanambu, Mossa and Chen2022; Xu et al., Reference Xu, Lin, Hu, Wang, Wu, Zhang and Ran2023b; Yin et al., Reference Yin, Zhu, Zhang, Xing, Xia, Liu and Zhang2023). However, so far, these alternatives to LSTM have primarily focused on temporal predictions in well-monitored locations.
Understanding how to leverage state-of-the-art ML with existing observational data for prediction in unmonitored sites can lend insights into both model selection and training for transfer to new regions, as well as sampling design for new monitoring paradigms to optimally collect data for modeling and analysis. However, to date, ML-based approaches have not been sufficiently compared or benchmarked, making it challenging for researchers to determine which architecture to use for a given prediction task. In this paper, we provide a comprehensive and systematic review of ML-based techniques for time series predictions in unmonitored sites and demonstrate their use for different environmental applications. We also enumerate the gaps and opportunities that exist for advancing research in this promising direction. The scope of our study is limited to using ML for predictions in unmonitored scenarios as defined above. We do not cover the many statistical and ML-based efforts in recent years for regionalizing process-based hydrological models, a topic that is covered extensively in the recent review Guo et al. (Reference Guo, Zhang, Zhang and Wang2020b). We also exclude remote sensing applications to estimate variables at previously unmonitored inland water bodies. This is a different class of problems and there are significant challenges to increasing the scale and robustness of remote sensing applications including atmospheric effects, measurement frequency, and insufficient resolution for smaller water bodies like rivers (Topp et al., Reference Topp, Pavelsky, Jensen, Simard and Ross2020), which are detailed in a number of reviews (Gholizadeh et al., Reference Gholizadeh, Melesse and Reddi2016; Giardino et al., Reference Giardino, Brando, Gege, Pinnel, Hochberg, Knaeps, Reusen, Doerffer, Bresciani and Braga2019; Odermatt et al., Reference Odermatt, Gitelson, Brando and Schaepman2012; Topp et al., Reference Topp, Pavelsky, Jensen, Simard and Ross2020).
We organize the paper as follows. Section 2 first describes different ML and knowledge-guided ML frameworks that have been applied for water resources time series predictions in unmonitored sites. Then, Section 3 summarizes and discusses overarching themes between methods, applications, regions, and datasets. Lastly, Section 3.1 analyzes the gaps in knowledge and lists open questions for future research.
2. Machine learning frameworks for predictions in unmonitored sites
In this section, we describe different ML methodologies that have been used for applications in water resources time series modeling for unmonitored sites. Generally, the process of developing these ML models first involves generating predictions for a set of entities (e.g., stream gauge sites, lakes) with monitoring data of the target variable (e.g., discharge, water quality). Then, the knowledge, data, or models developed on those systems are used to predict the target variable on entities with no monitoring data available. Importantly, for evaluation purposes, these models are often tested on pseudo-unmonitored sites, where data is withheld until the testing stage to mimic model building for real unmonitored sites.
The most commonly used type of model for this approach is known as an entity-aware model (Ghosh et al., Reference Ghosh, Yang, Khandelwal, He, Renganathan, Sharma, Jia and Kumar2023; Kratzert et al., Reference Kratzert, Klotz, Shalev, Klambauer, Hochreiter and Nearing2019c, Reference Kratzert, Klotz, Shalev, Klambauer, Hochreiter and Nearingd)Footnote 1, which attempts to incorporate inherent characteristics of different entities to improve prediction performance. These characteristics across the literature are also called attributes, traits, or properties. The concept is similar to trait-based modeling to map characteristics to function in ecology and other earth sciences (Zakharova et al., Reference Zakharova, Meyer and Seifan2019). The underlying assumption is that the input data used for prediction consists of both dynamic physical drivers (e.g., daily meteorology) and site-specific characteristics of each entity such as their geomorphology, climatology, land cover, or land use. Varied ML methodologies have been developed that differ both in how these characteristics are used to improve performance and also in how entities are selected and used for modeling. These approaches are described further below and include building a single model using all available entities or subgroups of entities deemed relevant to the target unmonitored sites (Section 2.1), transfer learning of models from well-monitored sites to target sites (Section 2.2), and a cross-cutting theme of integrating ML with domain knowledge and process-based models (Section 2.3).
2.1. Broad-scale models using all available entities or a subgroup of entities
Typically, process-based models have been applied and calibrated to specific locations, which is fundamentally different from the ML approach of building a single regionalized model on a large number of sites (hence referred to as a broad-scale model) that inherently differentiates between dynamic behaviors and characteristics of different sites (Golian et al., Reference Golian, Murphy and Meresa2021; Guo et al., Reference Guo, Zhang, Zhang and Wang2021). The objective of broad-scale modeling is to learn and encode these differences such that differences in site characteristics translate into appropriately heterogeneous hydrologic behavior. Usually, the choice is made to include all possible sites or entities in building a single broad-scale model. However, using the entirety of available data is not always optimal. Researchers may also consider selecting only a subset of entities for training for a variety of reasons including (1) the entire dataset may be imbalanced such that performance diminishes on minority system types (Wilson et al., Reference Wilson, Close, Abraham, Sarris, Banasiak, Stenger and Hadfield2020), (2) some types of entities may be noisy, contain erroneous or outlier data, or have varying amount of input data, or (3) to save on the computational expense of building a broad-scale model. Traditionally in geoscientific disciplines like hydrology, stratifying a large domain of entities into multiple homogeneous subgroups or regions that are “similar” is common practice. This is based on evidence in process-based modeling that grouping heterogeneous sites for regionalization can negatively affect performance when extrapolating to unmonitored sites (Hosking and Wallis, Reference Hosking and Wallis1997; Lettenmaier et al., Reference Lettenmaier, Wallis and Wood1987). Therefore, it remains an open question whether using all the available data is the optimal approach for building training datasets for predictions in unmonitored sites. Copious research has been done investigating various homogeneity criteria trying to find the best way to group sites for these regionalization attempts for process-based modeling (Burn, Reference Burn1990a, Reference Burnb; Guo et al., Reference Guo, Zhang, Zhang and Wang2021), and many recent approaches also leverage ML for clustering sites (e.g., using k-means (Aytac, Reference Aytac2020; Tongal and Sivakumar, Reference Tongal and Sivakumar2017)) prior to parameter regionalization (Guo et al., Reference Guo, Zhang, Zhang and Wang2021; Sharghi et al., Reference Sharghi, Nourani, Soleimani and Sadikoglu2018).
Many studies use subgroups of sites when building broad-scale models using ML. For example, Araza et al. (Reference Araza, Hein, Duku, Rawlins and Lomboy2020) demonstrate that a principal components analysis-based clustering of 21 watersheds in the Luzon region of the Philippines outperforms an entity-aware broad-scale model built on all sites together for daily streamflow prediction. Furthermore, Weierbach et al. (Reference Weierbach, Lima, Willard, Hendrix, Christianson, Lubich and Varadharajan2022) found that an ML model combining two regions of data in the United States for stream temperature prediction did not perform better than building models for each individual region. Chen et al. (Reference Chen, Zhu, Jiang and Sun2020) cluster weather stations by mean climatic characteristics when building LSTM and temporal convolution network models for predicting evapotranspiration in out-of-sample sites, claiming models performed better on similar climatic conditions. Additionally for stream water level prediction in unmonitored sites, Corns et al. (Reference Corns, Long, Hale, Kanwar and Vanfossan2022) group sites based on the distance to upstream and downstream gauges to include proximity to a monitoring station as criteria for input data selection. The water levels from the upstream and downstream gauges are also used as input variables. The peak flood prediction model described in Section 2.1 divides the models and data across the 18 hydrological regions in the conterminous US as defined by USGS (U.S. Geological Survey, 2016).
However, it remains to be seen how selecting a subgroup of entities as opposed to using all available data fairs in different prediction applications because much of this work does not compare the performances of both these cases. When viewed through the lens of modern data-driven modeling, evidence suggests deep learning methods in particular may benefit from pooling large amounts of heterogeneous training data. Fang et al. (Reference Fang, Kifer, Lawson, Feng and Shen2022) demonstrate this effect of “data synergy” on both streamflow and soil moisture modeling in gauged basins showing that deep learning models perform better when fed a diverse training dataset spanning multiple regions as opposed to homogeneous dataset on a single region even when the homogeneous data is more relevant to the testing dataset and the training datasets are the same size. A recent opinion piece Kratzert et al. (Reference Kratzert, Gauch, Klotz and Nearing2024) also make an argument against building deep learning models, specifically LSTM models, on streamflow data from small homogeneous sets of watersheds, especially for predicting unmonitored areas and for extreme events. Moreover in Willard (Reference Willard2023), regional LSTM models of stream temperature in the United States perform worse than the LSTM model built on all sites in the CONUS for 15 out of 17 regions, and single-site trained models transferred to the testing sites generally performed worse except when pre-trained on the global model.
Overall across broad-scale modeling efforts, studies differ in how the ML framework leverages the site characteristics. The following subsections describe different approaches of incorporating site characteristics into broad-scale models that use all available entities or a subgroup, covering direct concatenation of site characteristics and dynamic features, encoding of characteristics using ML, and the use of graph neural networks to encode dependencies between sites.
2.1.1. Direct concatenation broad-scale model
When aggregating data across many sites for an entity-aware broad-scale model, it is common to append site characteristics directly with the input forcing data directly before feeding it to the ML model. Shown visually in Figure 1, this is a simple approach that does not require novel ML architecture and is therefore very accessible for researchers. Although some characteristics can change over time, many applications treat these characteristics as static values over each timestep through this concatenation process, even though commonly used recurrent neural network-based approaches like LSTM are not built to incorporate static inputs (Li et al., Reference Li, Lyons, Klaus, Gage, Kollef and Lu2021a; Lin et al., Reference Lin, Zhang, Ivy, Capan, Arnold, Huddleston and Chi2018; Rahman et al., Reference Rahman, Yuan, Xie and Sha2020). In a landmark result for temporal streamflow predictions, Kratzert et al. (Reference Kratzert, Klotz, Shalev, Klambauer, Hochreiter and Nearing2019c, Reference Kratzert, Klotz, Shalev, Klambauer, Hochreiter and Nearing2019d) used an LSTM with directly concatenated site characteristics and dynamic inputs built on 531 geographically diverse catchments within the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset, and were able to predict more accurately on unseen data on the same 531 test sites than state-of-the-art process-based models calibrated to each basin individually. Given the success of the model, that study was expanded to the scenario of predicting unmonitored stream sites (Kratzert et al., Reference Kratzert, Herrnegger, Klotz, Hochreiter and Klambauer2019a), where they found the accuracy of the broad-scale LSTM with concatenated features in ungauged basins was comparable to calibrated process-based models in gauged basins. Arsenault et al. (Reference Arsenault, Martel, Brunet, Brissette and Mai2023) and Jiang et al. (Reference Jiang, Zheng and Solomatine2020) further show a similar broad-scale LSTM can outperform the state-of-the-art regionalization of process-based models for predictions in ungauged basins in the United States, and similar results are seen in Russian (Ayzel et al., Reference Ayzel, Kurochkina, Kazakov and Zhuravlev2020), Brazilian (Nogueira Filho et al., Reference Nogueira Filho, Souza Filho, Porto, Vieira Rocha, Sousa Estácio and Martins2022), and Korean (Choi et al., Reference Choi, Lee and Kim2022) watersheds. More recently, attention-based transformer models have been used in Yin et al. (Reference Yin, Zhu, Zhang, Xing, Xia, Liu and Zhang2023) for streamflow prediction on the CAMELS dataset showing improved performance over multiple kinds of LSTM models for both prediction in individual ungauged sites and entire ungauged regions. Broad-scale models have also been used for the prediction of other environmental variables like continental-scale snow pack dynamics (Wang et al., Reference Wang, Gupta, Zeng and Niu2022), monthly baseflow (Xie et al., Reference Xie, Liu, Tian, Wang, Bai and Liu2022), dissolved oxygen in streams (Zhi et al., Reference Zhi, Feng, Tsai, Sterle, Harpold, Shen and Li2021), and lake surface temperature (Willard et al., Reference Willard, Read, Topp, Hansen and Kumar2022b).
The previously mentioned approaches in most cases focus on predicting mean daily values, but accurate predictions of extremes (e.g., very high flow events or droughts) remain an outstanding and challenging problem in complex spatiotemporal systems (J. Jiang et al., Reference Jiang, Huang, Grebogi and Lai2022). This is a longstanding fundamental challenge in catchment hydrology (Salinas et al., Reference Salinas, Laaha, Rogger, Parajka, Viglione, Sivapalan and Blöschl2013), where typically the approach has been to subdivide the study area into fixed, contiguous regions that are used to regionalize predictions for floods or low flows from process-based models for all catchments in a given area. At least for process-based models, this has been shown to be more successful than global regionalizations (Salinas et al., Reference Salinas, Laaha, Rogger, Parajka, Viglione, Sivapalan and Blöschl2013). As recent ML and statistical methods are shown to outperform process-based models for the prediction of extremes (Frame et al., Reference Frame, Kratzert, Klotz, Gauch, Shelev, Gilon, Qualls, Gupta and Nearing2022; Viglione et al., Reference Viglione, Parajka, Rogger, Salinas, Laaha, Sivapalan and Blöschl2013), opportunities exist to apply broad-scale entity-aware methods in the same way as daily averaged predictions. Challenges facing ML models for extreme prediction include replacing common loss functions like mean squared error which tend to prioritize average behavior and may not adequately capture rare and extreme events (Mudigonda et al., Reference Mudigonda, Ram, Kashinath, Racah, Mahesh, Liu, Beckham, Biard, Kurth and Kim2021), and dealing with the common scenario of extreme data being sparse (Zhang et al., Reference Zhang, Alexander, Hegerl, Jones, Tank, Peterson, Trewin and Zwiers2011). Initial studies using broad-scale models with concatenated inputs for peak flood prediction show that these methods can also be used to predict extremes. For instance, Rasheed et al. (Reference Rasheed, Aravamudan, Sefidmazgi, Anagnostopoulos and Nikolopoulos2022) built a peak flow prediction model that combines a “detector” LSTM that determines if the meteorological conditions pose a flood risk, with an entity-aware ML model for peak flow prediction to be applied if there is a risk. They show that building a model only on peak flows and combining it with a detector model improves performance over the broad-scale LSTM model trained to predict mean daily flows (e.g., Kratzert et al. (Reference Kratzert, Klotz, Herrnegger, Sampson, Hochreiter and Nearing2019b)). Though initial studies like this show promise, further research is required to compare techniques that deal with the imbalanced data, that is, extreme events are often rare outliers, different loss functions and evaluation metrics for extremes, and different ML architectures.
Based on these results, it appears as though site characteristics can contain sufficient information to differentiate between site-specific dynamic behaviors for a variety of prediction tasks. This challenges a longstanding hydrological perspective that transferring models and knowledge from one basin to another requires that they must be functionally similar (Fang et al., Reference Fang, Kifer, Lawson, Feng and Shen2022; Guo et al., Reference Guo, Zhang, Zhang and Wang2021; Razavi and Coulibaly, Reference Razavi and Coulibaly2013 since these broad-scale models are built on a large number of heterogeneous sites. A recent study Li et al. (Reference Li, Khandelwal, Jia, Cutler, Ghosh, Renganathan, Xu, Tayal, Nieber and Duffy2022) also substitutes random values as a substitute for site characteristics in a direct concatenation broad-scale LSTM to improve performance and promote entity-awareness in the case of missing or uncertain characteristics.
2.1.2. Concatenation of encoded site characteristics for broad-scale models
Though recurrent neural network models like the LSTM have been used with direct concatenation of static and dynamic features, other methods have been developed that encode watershed characteristics as static features to improve accuracy or increase efficiency. As shown in Figure 2, one approach is to use two separate neural networks, where the first learns a representation of the “static” characteristics using an encoding neural network (e.g., an autoencoder), and the second takes that encoded representation at each time-step along with dynamic time-series inputs to predict the target using a time series ML framework (e.g., LSTM). This has been shown to be effective mostly in healthcare data applications (Esteban et al., Reference Esteban, Staeck, Baier, Yang and Tresp2016; Li et al., Reference Li, Lyons, Klaus, Gage, Kollef and Lu2021a; Lin et al., Reference Lin, Zhang, Ivy, Capan, Arnold, Huddleston and Chi2018), but also in lake temperature prediction in Tayal et al. (Reference Tayal, Jia, Ghosh, Willard, Read and Kumar2022). The idea is to extract the information from characteristics that account for data heterogeneity across multiple entities. This extraction process is independent of the LSTM or similar time series model handing the dynamic input and therefore can be flexible in how the two components are connected. Examples to improve efficiency include, (1) static information may not be needed at every time step and be applied only at the time step of interest (Lin et al., Reference Lin, Zhang, Ivy, Capan, Arnold, Huddleston and Chi2018), or (2) the encoding network can be used to reduce the dimension of static features prior to connecting with the ML framework doing the dynamic prediction (Kao et al., Reference Kao, Liou, Lee and Chang2021). In terms of performance, works from multiple disciplines have found these types of approaches improve accuracy over the previously described direct concatenation approach (Lin et al., Reference Lin, Zhang, Ivy, Capan, Arnold, Huddleston and Chi2018; Rahman et al., Reference Rahman, Yuan, Xie and Sha2020; Tayal et al., Reference Tayal, Jia, Ghosh, Willard, Read and Kumar2022).
In water resources applications, Tayal et al. (Reference Tayal, Jia, Ghosh, Willard, Read and Kumar2022) demonstrate this in lake temperature prediction using an invertible neural network in the encoding step, showing slight improvement over the static and dynamic concatenation approach. Invertible neural networks have the ability to model forward and backward processes within a single network in order to solve inverse problems. For example, their model uses lake characteristics and meteorological data to predict lake temperature, but can also attempt to derive lake characteristics from lake temperature data. It has also been shown in streamflow prediction that this type of encoder network can be used either on the site characteristics (S. Jiang et al., Reference Jiang, Zheng and Solomatine2020) or also on partially available soft data like soil moisture or flow duration curves (Feng et al., Reference Feng, Lawson and Shen2021). In Jiang et al. (Reference Jiang, Zheng and Solomatine2020), they include a feed-forward neural network to process static catchment-specific attributes separately from dynamic meteorological data prior to predicting with a physics-informed neural network model. However, it is not directly compared with a model using the static features without any processing in a separate neural network so the added benefit is unclear. Feng et al. (Reference Feng, Lawson and Shen2021) further show an encoder network to encode soil moisture data if it is available prior to predicting streamflow with an LSTM model but show limited benefit over not including the soil moisture data.
2.1.3. Broad-scale graph neural networks
The majority of works in this study treat entities as systems that exist independently from each other (e.g., different lakes and different stream networks). However, many environmental and geospatial modeling applications exhibit strong dependencies and coherence between systems (Reichstein et al., Reference Reichstein, Camps-Valls, Stevens, Jung, Denzler and Carvalhais2019). These dependencies can be real, interactive physical connections, or a coherence in dynamics due to certain similarities regardless of whether the entities interact. For example, water temperature in streams is affected by a combination of natural and human-involved processes including meteorology, interactions between connected stream segments within stream networks, and the process of water management and timed-release reservoirs. Similar watersheds, basins, or lakes may also exhibit dependencies and coherence based on characteristics or climatic factors (George et al., Reference George, Talling and Rigg2000; Huntington et al., Reference Huntington, Hodgkins and Dudley2003; Kingston et al., Reference Kingston, McGregor, Hannah and Lawler2006; Magnuson et al., Reference Magnuson, Benson and Kratz1990). Popular methods like the previously described broad-scale models using direct concatenation of inputs (Section 2.1.1) offer no intuitive way to encode interdependencies between entities (e.g., in a connected stream network) and often ignore these effects. Researchers are beginning to explore different ways to encode these dependencies explicitly by using graph neural networks (GNNs) for broad-scale modeling of many entities. The use of GNNs can allow the modeling of complex relationships and interdependencies between entities, something traditional feed-forward or recurrent neural networks cannot do (Wu et al., Reference Wu, Pan, Chen, Long, Zhang and Philip2020). GNNs have seen a surge in popularity in recent years for many scientific applications and several extensive surveys of GNNs are available in the literature (Battaglia et al., Reference Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro and Faulkner2018; Bronstein et al., Reference Bronstein, Bruna, LeCun, Szlam and Vandergheynst2017; Wu et al., Reference Wu, Pan, Chen, Long, Zhang and Philip2020; Zhou et al., Reference Zhou, Cui, Hu, Zhang, Yang, Liu, Wang, Li and Sun2020). Hydrological processes naturally have both spatial and temporal components, and GNNs attempt to exploit the spatial connections, causative relations, or dependencies between similar entities analogous to the way that the LSTM architecture exploits temporal patterns and dependencies. Recent work has attempted to encode stream network structure within GNNs to capture spatial and hydrological dependencies for applications like drainage pattern recognition (Yu et al., Reference Yu, Ai, Yang, Huang and Yuan2022), groundwater level prediction (Bai and Tahmasebi, Reference Bai and Tahmasebi2022), rainfall-runoff or streamflow prediction (Feng et al., Reference Feng, Sha, Ding, Yan and Yu2022c; Kazadi et al., Reference Kazadi, Doss-Gollin, Sebastian and Silva2022; Kratzert et al., Reference Kratzert, Klotz, Gauch, Klingler, Nearing and Hochreiter2021; Sit et al., Reference Sit, Demiray and Demir2021; Sun et al., Reference Sun, Jiang, Mudunuru and Chen2021a; Zhao et al., Reference Zhao, Zhu, Shu, Wan, Yu, Zhou and Liu2020), lake temperature prediction (Stalder et al., Reference Stalder, Ozdemir, Safin, Sukys, Bouffard and Perez-Cruz2021), and stream temperature prediction (Bao et al., Reference Bao, Jia, Zwart, Sadler, Appling, Oliver and Johnson2021; Chen et al., Reference Chen, Appling, Oliver, Corson-Dosch, Read, Sadler, Zwart and Jia2021a, Reference Chen, Zwart and Jia2022).
In hydrology, there are three intuitive methods for the construction of the graph itself. The first is geared towards non-interacting entities, building the graph in the form of pair-wise similarity between entities, whether that be between site characteristics (Sun et al., Reference Sun, Jiang, Mudunuru and Chen2021a), spatial locations (Sun et al., Reference Sun, Yao, Bi, Huang, Zhao and Qiao2021b; Zhang et al., Reference Zhang, Li, Frery and Ren2021) (e.g., latitude/longitude) or both (Xiang and Demir, Reference Xiang and Demir2021). The second type is geared more toward physically interacting entities, for example, the upstream and downstream connections between different stream segments in a river network (Jia et al., Reference Jia, Zwart, Sadler, Appling, Oliver, Markstrom, Willard, Xu, Steinbach, Read and Kumar2021b) or connections between reservoirs with timed water releases to downstream segments (Chen et al., Reference Chen, Appling, Oliver, Corson-Dosch, Read, Sadler, Zwart and Jia2021a). The third type starts with an a priori connectivity matrix like the previous type but lets the GNN learn an adaptive connectivity matrix during training based on the sites’ dynamic inputs, attributes, or location (Sun et al., Reference Sun, Jiang, Yang, Xie and Chen2022). Relying solely on the characteristics or location for graph construction in the non-interacting case more easily allows for broad-scale modeling because it can model spatially disconnected entities, however, it introduces no new information (e.g., physical connectivity) beyond what the previously described direct concatenation-based methods use since the static characteristics would be the same. However, performance could still improve and interpretations of encodings within a graph framework could yield new scientific discoveries since pairwise encodings between entities can be directly extracted. Graphs built using real physical connections between entities (e.g., stream segments in a stream graph), on the other hand, allow for the capability to learn how information is routed through the graph and how different entities physically interact with each other. So far, this has only been seen in stream modeling using stream network graphs (Bindas et al., Reference Bindas, Shen and Bian2020; Jia et al., Reference Jia, Zwart, Sadler, Appling, Oliver, Markstrom, Willard, Xu, Steinbach, Read and Kumar2021b; Kratzert et al., Reference Kratzert, Klotz, Gauch, Klingler, Nearing and Hochreiter2021; Topp et al., Reference Topp, Barclay, Diaz, Sun, Jia, Lu, Sadler and Appling2023). The third type is useful when combining the physical connectivity between sites with similarity in inputs, and also in cases where the inputs are at a different scale than the target variable, for example, when meteorological variables are at kilometer scale and streamflow is at point scale.
There are two different classes of GNN models, transductive and inductive, which differ in how the graph is incorporated into the learning process. Depending on how the graphs are constructed, one of these is more natural than the other. A conceptual depiction of both is shown in Figure 3. The key aspect of transductive GNNs is that both training and testing entities must be present in the graph during training. A prerequisite for this approach is that the test data (e.g., input features in unmonitored sites) is available during model training, and one key aspect is that the model would need to be completely re-trained upon the introduction of new test data. Even if the training data is unchanged prior to re-training, introducing new test nodes in the graph can affect how information is diffused to each training node during optimization (Ciano et al., Reference Ciano, Rossi, Bianchini and Scarselli2021). This type of approach is generally preferred for river network modeling given the often unchanging spatial topology of the sub-basin structure which is known a priori (Jia et al., Reference Jia, Zwart, Sadler, Appling, Oliver, Markstrom, Willard, Xu, Steinbach, Read and Kumar2021b; Moshe et al., Reference Moshe, Metzger, Elidan, Kratzert, Nevo and El-Yaniv2020; Sit et al., Reference Sit, Demiray and Demir2021). Graph connections from the test nodes to the training nodes in a transductive setting can be used either in the training or prediction phase, or both (Rossi et al., Reference Rossi, Tiezzi, Dimitri, Bianchini, Maggini and Scarselli2018). Inductive GNNs on the other hand, are built using only training entities and allow for new entity nodes to be integrated during testing. For applications that continuously need to predict on new test data, inductive approaches are much more preferred. New entity nodes are able to be incorporated because inductive frameworks also learn an information aggregator that transfers the necessary information from similar or nearby nodes to predict at nodes unseen during training. However, this also means connections between nodes are only present in the test data and those in the training data are unseen during model training as opposed to transductive approaches where they are included. As shown in Figure 3, inductive graph learning can either be done on nodes that connect with training set nodes in the graph or those that are disconnected. Inductive GNNs can be understood as being in the same class as more standard supervised ML models like LSTM or feed-forward neural networks, where they are able to continuously predict on new test data without the need for re-training.
A few studies use GNNs for prediction in unmonitored sites for water resources applications. Sun et al. (Reference Sun, Jiang, Mudunuru and Chen2021a) use different types of spatiotemporal GNNs including three transductive GNN methods, two variants of the ChebNet-LSTM (Yan et al., Reference Yan, Wang, Yu, Jin and Zhang2021) and a Graph Convolutional Network LSTM (GCN-LSTM) (Seo et al., Reference Seo, Defferrard, Vandergheynst and Bresson2018), compared with a GNN that can used as either transductive or inductive, GraphWaveNet (Wu et al., Reference Wu, Pan, Long, Jiang and Zhang2019). In all cases, the graph is initially constructed as an adjacency matrix containing the pairwise Euclidean distance between stream sites using site characteristics. Importantly, all four models simplify to direct concatenation-based models described in Section 2.1 if the graph convolution-based components are removed (See Figure S2 in Sun et al. (Reference Sun, Jiang, Mudunuru and Chen2021a) for a visualization). For ChebNet-LSTM and GCN-LSTM, the direct concatenation-approach would effectively simplify the architecture to a traditional LSTM, and for GraphWaveNet, it would simplify to a gated temporal convolution network (TCN). They found that for the transductive case, both ChebNet-LSTMs and GCN-LSTM performed worse in terms of median performance across basins than the standard LSTM and GraphWaveNet was the only one that performed better. GraphWaveNet, the only GNN also capable of doing inductive learning, also performed better in the inductive case than standard LSTM. Jia et al. (Reference Jia, Zwart, Sadler, Appling, Oliver, Markstrom, Willard, Xu, Steinbach, Read and Kumar2021b) take a different spatiotemporal GNN approach for stream temperature temporal predictions, where they construct their graph by using stream reach lengths with upstream and downstream connections to construct a weighted adjacency matrix. They found their GNN pre-trained on simulation data from the PRMS-SNTemp process-based model (Markstrom, Reference Markstrom2012) outperformed both a non-pre-trained GNN and a baseline LSTM model. Based on these results, we see that encoding dependencies based on site characteristics as well as physical interaction and stream connections within GNNs, can improve performance over existing deep learning models like the feed-forward artificial neural network (ANN) or LSTM.
Some studies have explored different ways of constructing the adjacency matrix based on the application and available data. An example of a domain-informed method for graph construction can be seen in Bao et al. (Reference Bao, Jia, Zwart, Sadler, Appling, Oliver and Johnson2021) for stream temperature predictions in unmonitored sites, where they leverage partial differential equations of underlying heat transfer processes to estimate the graph structure dynamically. This graph structure is combined with temporal recurrent layers to improve prediction performance beyond existing process-based and ML approaches. Dynamic temporal graph structures like this are common in other disciplines like social media analysis and recommender systems but have not been widely used in geosciences (Longa et al., Reference Longa, Lachi, Santin, Bianchini, Lepri, Lio, Scarselli and Passerini2023).
2.2. Transfer learning
Transfer learning is a powerful technique for applying knowledge learned from one problem domain to another, typically to compensate for missing, non-existent, or unrepresentative data in the new problem domain. The idea is to transfer knowledge from an auxiliary task, that is, the source system, where adequate data is available, to a new but related task, that is, the target system, often where data is scarce or absent (Pan and Yang, Reference Pan and Yang2009; Weiss et al., Reference Weiss, Khoshgoftaar and Wang2016). Situations where transfer learning may be more desirable than broad-scale modeling include when (1) a set of highly tuned and reliable source models (ML, process-based or hybrid) may already be available, (2) local source models are more feasible computationally or more accurate than broad-scale models when applied to unmonitored systems, or (3) broad-scale models may need to be transferred and fine-tuned to a given region or system type more similar to an unmonitored system. In the context of geoscientific modeling, transfer learning for ML is analogous to calibrating process-based models in well-monitored systems and transferring the calibrated parameters to models for unmonitored systems, which has shown success in hydrological applications (Kumar et al., Reference Kumar, Samaniego and Attinger2013; Roth et al., Reference Roth, Nigussie and Lemann2016). Deep learning is particularly amenable to transfer learning because it can make use of massive datasets from related problems and alleviate data paucity issues common in applying data-hungry deep neural networks to environmental applications (Naeini and Uwaifo, Reference Naeini and Uwaifo2019; Shen, Reference Shen2018). Transfer learning using deep learning has shown recent success in water applications such as flood prediction (Kimura et al., Reference Kimura, Yoshinaga, Sekijima, Azechi and Baba2019; Zhao et al., Reference Zhao, Pang, Xu, Cui, Wang, Zuo and Peng2021), soil moisture (Li et al., Reference Li, Wang, Shangguan, Li, Yao and Yu2021b), and lake and estuary water quality(Tian et al., Reference Tian, Liao and Wang2019; Willard et al., Reference Willard, Read, Appling and Oliver2021a).
Transfer learning can also be a capable tool for predictions in unmonitored sites (Tabas and Samadi, Reference Tabas and Samadi2021), although most applications typically assume that some data is available in the target system for fine-tuning a model, which is often referred to as few shot learning with sparse data (Weiss et al., Reference Weiss, Khoshgoftaar and Wang2016; Zhuang et al., Reference Zhuang, Qi, Duan, Xi, Zhu, Zhu, Xiong and He2020). The specific case of transferring to a system or task without any training data is also known as “zero-shot learning” (Romera-Paredes and Torr, Reference Romera-Paredes and Torr2015), where only the inputs or a high-level description may be available for the testing domain that does not contain any target variable values. This is a significantly more challenging problem because taking a pre-trained model from a data-rich source system and fine-tuning it on the target system is not possible, and instead other contextual data about the source and target systems must be used. In the case of unmonitored prediction, we often only have the dynamic forcing data and the characteristics of the target system (site) available. The following subsections cover different ways researchers have addressed the zero-shot transfer learning problem for water resources prediction.
2.2.1. Choosing which model to transfer
A central challenge in zero-shot transfer learning is determining which model to transfer from a related known task or how to build a transferable model. Previous work on streamflow prediction has based this purely on expert knowledge. For example, Singh et al. (Reference Singh, Mishra, Pingale, Khare and Thakur2022) operate under the assumption that the model must be trained on other basins in the same climatic zone, and at least some of the source basin’s geographical area must have similar meteorological conditions to the target basin. Other work has transferred models from data-rich regions to data-poor regions without any analysis of the similarity between the source and target regions. For example, Le et al. (Reference Le, Kim, Adam, Do, Beling and Lakshmi2022) transfer ML streamflow models built on North America (987 catchments), South America (813 catchments), and Western Europe (457 catchments); to data-poor South Africa and Central Asian regions. They transfer these models as-is and do not take into account any of the sparse data in the data-poor region or the similarity between regions and find that the local models trained on minimal data outperform the models from data-rich regions. Attempts have also been made to use simple expert-created distance-based metrics (e.g., Burn and Boorman (Reference Burn and Boorman1993)) using the site characteristic values (Vaheddoost et al., Reference Vaheddoost, Safari and Yilmaz2023). However, it is reasonable to think that a data-driven way to inform model building based on both the entity’s characteristics and past modeling experiences would be possible.
The idea of building or selecting a model by leveraging preexisting models is a type of meta-learning (Brazdil, Reference Brazdil2009; Lemke et al., Reference Lemke, Budka and Gabrys2015). More broadly meta-learning is the concept of algorithms learning from other algorithms, often in the task of selecting a model or learning how to best combine predictions from different models in the context of ensemble learning. One meta-learning strategy for model selection is to build a metamodel to learn from both the model parameters of known tasks (with ground truth observations) and the correlation of known tasks to zero-shot tasks (Pal and Balasubramanian, Reference Pal and Balasubramanian2019). For example, in lake temperature modeling, Willard et al. (Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b) use meta-learning for a model selection framework where a metamodel learns to predict the error of transferring a model built on a data-rich source lake to an unmonitored target lake. A diagram of the approach is shown in Figure 4. A variety of contextual data is used to make this prediction, including (1) characteristics of the lake (e.g., maximum depth, surface area, clarity, etc., (2) meteorological statistics (e.g., average and standard deviation of air temperature, wind speed, humidity, etc., (3) simulation statistics from an uncalibrated process-based model applied to both the source and target (e.g., differences in simulated lake stratification frequency), and (4) general observation statistics (e.g., number of training data points available on the source, average lake depth of measured temperature, etc). They show significantly improved performance predicting temperatures in 305 target lakes treated as unmonitored in the Upper Midwestern United States relative to the uncalibrated process-based General Lake Model (Hipsey et al., Reference Hipsey, Bruce, Boon, Busch, Carey, Hamilton, Hanson, Read, de Sousa, Weber and Winslow2019), the previous state-of-the-art for broad-scale lake thermodynamic modeling. This was expanded to a streamflow application by Ghosh et al. (Reference Ghosh, Li, Tayal, Kumar and Jia2022) with numerous methodological adaptations. First, instead of using the site characteristics as is, they use a sequence autoencoder to create embeddings for all the stream locations by combining input time series data and simulated data generated by a process-based model. This adaptation alleviated a known issue in the dataset where the site characteristics were commonly incomplete and inaccurate. They also use a clustering loss function term in the sequence autoencoder to guide the model transfer, where source systems are selected based on available source systems within a given cluster of sites as opposed to building an ensemble with a set number of source sites. The clustering loss function term allows the model to learn a latent space that can correctly cluster river streams that can accurately transfer to one another. They show on streams within the Delaware River Basin that this outperforms the aforementioned simpler meta-transfer learning frameworks on sites based on Willard et al. (Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b). Willard (Reference Willard2023), expand on Willard et al. (Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b) by also building a meta-transfer learning framework that pre-trains each source model on CONUS-scale data, aiming to combine the benefits of broad-scale modeling and site-specific transfer learning for the task of stream temperature prediction. They find a small performance improvement over the existing direct concatenation approach building a single model on all stream entities in the CONUS.
2.2.2. Fine-tuning models with sparse data
A common hydrologic prediction scenario is one in which broad-scale data and models are available but a target site has inadequate or sparse data. This is especially seen in remote, inaccessible, or under-monitored regions. Given a pre-trained model on broad-scale data or simulated process-based model outputs, fine-tuning ML models by adjusting parameters during a second training instance has the potential to improve the accuracy and relevance of the model for specific local conditions. Pre-training on process-based model outputs and fine-tuning on minimal sparse data has shown to be effective in lake temperature (Jia et al., Reference Jia, Willard, Karpatne, Read, Zwart, Steinbach and Kumar2021a; Willard et al., Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b) and stream temperature (Jia et al., Reference Jia, Willard, Karpatne, Read, Zwart, Steinbach and Kumar2021a) prediction for as little as 0.1% of available data to simulate a common prediction scenario where only a few measurements of the target variable may be available and show substantial increase in performance over an uncalibrated process-based model. Furthermore, in soil moisture prediction Li et al. (Reference Li, Wang, Shangguan, Li, Yao and Yu2021b) show an effective pre-training on the large-scale process-based reanalysis ERA5-Land dataset (Muñoz-Sabater et al., Reference Muñoz-Sabater, Dutra, Agustí-Panareda, Albergel, Arduini, Balsamo, Boussetta, Choulga, Harrigan and Hersbach2021) and fine-tuning on the smaller SMAP data (O’Neill et al., Reference O’Neill, Entekhabi, Njoku and Kellogg2010) showing increased explained variation of over 20% compared to the non-fine-tuned version.
Another transfer learning with fine-tuning strategy in geoscientific modeling that can also be based on pre-training is to localize a larger-scale or more data-rich regional or global model to a specific location or subregion. This variant of transfer learning has seen success in deep learning models for applications like soil spectroscopy (Padarian et al., Reference Padarian, Minasny and McBratney2019; Shen et al., Reference Shen, Ramirez-Lopez, Behrens, Cui, Zhang, Walden, Wetterlind, Shi, Sudduth and Baumann2022) and snow cover prediction (Guo et al., Reference Guo, Chen, Liu and Zhao2020a; Wang et al., Reference Wang, Yuan, Shen, Liu, Li, Yue, Shi and Zhang2020a). However, these strategies have seen mixed success in hydrological applications. Wang et al. (Reference Wang, Gupta, Zeng and Niu2022) show that localizing an LSTM predicting continental-scale snowpack dynamics to individual regions across the United States had insignificant benefits over the continental-scale LSTM. Xiong et al. (Reference Xiong, Zheng, Chen, Tian, Liu, Han, Jiang, Lu and Zheng2022) show a similar result for the prediction of stream nitrogen export, where the individual models for the 7 distinct regions across the conterminous United States transferred to each other did not outperform the continental-scale model using all the data. Also, Lotsberg (Reference Lotsberg2021) showed that streamflow models trained on CAMELS-US (United States) transfer to CAMELS-GB (Great Britain) about as well as a model trained on the combined data from US and GB, and models trained on CAMELS-GB transfer to CAMELS-US about as well as a model using the combined data. They also show that the addition of site characteristics is not beneficial in transfer learning tasks, but acknowledge this could be due to the way data is normalized prior to training. Based on these results, it is possible that the entity-aware broad-scale model using all available data is already learning to differentiate between different regions or types of sites on its own, and fine-tuning to more similar sites based on expert knowledge may be less useful. However, this remains to be demonstrated for most hydrological and water resources prediction tasks. Other studies have also continued the practice of pre-training a model on a data-dense region like the United States and fine-tuning on data-sparse regions like China (Ma et al., Reference Ma, Feng, Lawson, Tsai, Liang, Huang, Sharma and Shen2021; Xu et al., Reference Xu, Lin, Hu, Wang, Wu, Zhang and Ran2023b) or Kenya (Oruche et al., Reference Oruche, Egede, Baker and O’Donncha2021).
2.2.3. Unsupervised domain adaptation
Domain adaptation methods are a subset of transfer learning algorithms that attempt to answer the question, how can a model both learn from a source domain and learn to generalize to a target domain? Often domain adaptation seeks to minimize the risk of making errors on the target data, and not necessarily on the source data as in traditional supervised learning. Unsupervised domain adaptation (UDA), in particular, focuses on the zero-shot learning case of the target domain being void of target data. Similar to the types of graph neural networks mentioned in Section 2.1.3, review papers have divided transfer learning algorithms into the categories, (1) inductive transfer learning where the source and target tasks are different and at least some labeled data from the target task is required to induce a model, (2) transductive transfer learning where the source and target tasks are the same but from different feature space domains and zero labeled data is available from the target domain, and (3) unsupervised transfer learning where no labeled data is available in both the source and target domains (S. Niu et al., Reference Niu, Liu, Wang and Song2020; Pan and Yang, Reference Pan and Yang2009). UDA specifically lies in the transductive transfer learning scenario and usually involves using the input data from the target or testing task during the training process, in addition to the source data. This aspect differentiates UDA from the previously described methods in this section. Researchers can employ different UDA methods when attempting to account for differences in the source and target tasks and datasets. Commonly UDA methods attempt to account for the difference in input feature distribution shifts between the source and task, but other methods attempt to account for the difference in distributions of labeled data. This differs from previous approaches we have mentioned like the broad-scale models that generally ignore input data from testing sites, meta-transfer learning that uses test data inputs during model selection but not during training, and localizing regional models that uses available data from regions containing the test sites but not any data from the test sites themselves. UDA has seen success in many disciplines including computer vision (Csurka, Reference Csurka2017; Patel et al., Reference Patel, Gopalan, Li and Chellappa2015), robotics (Bousmalis et al., Reference Bousmalis, Irpan, Wohlhart, Bai, Kelcey, Kalakrishnan, Downs, Ibarz, Pastor and Konolige2018; Hoffman et al., Reference Hoffman, Wang, Yu and Darrell2016), natural language processing (Blitzer et al., Reference Blitzer, Dredze and Pereira2007), and fault diagnostics (Shi et al., Reference Shi, Ying and Yang2022) but applications of UDA in hydrology are limited. In the only current hydrological example, Zhou and Pan (Reference Zhou and Pan2022) introduce a UDA framework for unmonitored flood forecasting that involves a two-stage adversarial learning approach. The model is first pre-trained on a large sample source dataset, then they perform adversarial domain adaptation using an encoder to map the source and target inputs to the same feature space and learn the difference between the source and target datasets. They show this method is effective in flood forecasting across the Tunxi and Changhua flood datasets spanning Eastern China and Taiwan. Currently, UDA that accounts for a shift in label distribution (real or synthetic) has not been attempted in hydrological prediction, and future research on UDA in hydrology will need to consider whether to account for either input or label distribution shift between entities and systems.
2.3. Cross-cutting theme: knowledge-guided machine learning
There is a growing consensus that solutions to complex non-linear environmental and engineering problems will require novel methodologies that are able to integrate traditional process-based modeling approaches with state-of-the-art ML techniques, known as Knowledge-guided machine learning (KGML) (Karpatne et al., Reference Karpatne, Kannan and Kumar2022) (also known as Physics-guided machine learning or Physics-informed machine learning (Karpatne et al., Reference Karpatne, Atluri, Faghmous, Steinbach, Banerjee, Ganguly, Shekhar, Samatova and Kumar2017a; Muther et al., Reference Muther, Dahaghi, Syed and Van Pham2022; Willard et al., Reference Willard, Jia, Xu, Steinbach and Kumar2022a)). These techniques have been demonstrated to improve prediction in many applications including lake temperature (Jia et al., Reference Jia, Willard, Karpatne, Read, Zwart, Steinbach and Kumar2021a; Read et al., Reference Read, Jia, Willard, Appling, Zwart, Oliver, Karpatne, Hansen, Hanson and Watkins2019), streamflow (Bhasme et al., Reference Bhasme, Vagadiya and Bhatia2022; Herath et al., Reference Herath, Chadalawada and Babovic2021; Hoedt et al., Reference Hoedt, Kratzert, Klotz, Halmich, Holzleitner, Nearing, Hochreiter and Klambauer2021), groundwater contamination (Soriano et al., Reference Soriano, Siegel, Johnson, Gutchess, Xiong, Li, Clark, Plata, Deziel and Saiers2021), and water cycle dynamics (Ng et al., Reference Ng, Samadi, Wang and Bao2021) among others. Willard et al. (Reference Willard, Jia, Xu, Steinbach and Kumar2022a) divide KGML methodologies into four classes; (i) physics-guidedFootnote 2 loss function, (ii) physics-guided initialization, (iii) physics-guided design of architecture, and (iv) hybrid physics-ML modeling. Many of these methods are helpful for prediction in unmonitored sites since known physics or existing models can exist in the absence of observed target data. Note that KGML is a cross-cutting theme, as its principles can be integrated into either of the previously described broad-scale modeling and transfer learning approaches. The benefits we see from KGML as a class of standalone techniques can also help address resource efficiency issues in building both broad-scale entire-aware models and source models in transfer learning while maintaining high predictive performance, training data efficiency, and interpretability relative to traditional ML approaches (Willard et al., Reference Willard, Jia, Xu, Steinbach and Kumar2022a).
The field of KGML is rapidly advancing, and given the numerous applications we see for its use in hydrology, we include the following discussion on the different ways of harnessing KGML techniques in a given physical problem that has traditionally been simulated using process-based models. The following three subsections are divided based on how KGML techniques are used to either replace, augment, or recreate an existing process-based model. Section 3.1.4 further expands on this discussion by addressing the role of KGML in the future of unmonitored prediction and open questions that exist.
2.3.1. Guiding ML with domain knowledge: KGML loss functions, architecture, and initialization
Traditional process-based models for simulating environmental variables in complex systems do not capture all the processes involved which can lead to incomplete model structure (e.g., from simplified or missing physics). Though a key benefit of pure ML is the flexibility to literally fit any dataset as well as not being beholden to the causal structure that process-based models are, its inability to make use of process-based knowledge can lead to negative effects like sample inefficiency, inability to generalize to out-of-sample scenarios, and physically inconsistent solutions. When building an ML model as a replacement for a process-based model, there are at least three considerations to guide the ML model with domain knowledge for improved predictive performance; KGML loss function terms, architecture, and initialization.
KGML loss function terms can constrain model outputs such that they conform to existing physical laws or governing equations. In dynamical systems modeling and solving partial differentiable equations, this technique is known as physics-informed neural networks (PINNs) pioneered by Raissi et al. (Reference Raissi, Perdikaris and Karniadakis2019). Steering ML predictions towards physically consistent outputs have numerous benefits. For prediction in unmonitored or data-sparse scenarios, the major benefit of informed loss function terms is that often the computation requires little to no observation data. Therefore, optimizing for that term allows for the inclusion of unlabeled data in training, which is often the only data available. Other benefits include that the regularization by physical constraints can reduce the possible search space of parameters, and also potentially learning with fewer labeled data, while also ensuring the consistency with physical laws during optimization. KGML loss function terms have also shown that models following desired physical properties are more likely to be generalizable to out-of-sample scenarios (Read et al., Reference Read, Jia, Willard, Appling, Zwart, Oliver, Karpatne, Hansen, Hanson and Watkins2019), and thus become acceptable for use by domain scientists and stakeholders in water resources applications. Loss function terms corresponding to physical constraints are applicable across many different types of ML frameworks and objectives, however, most of these applications have been in the monitored prediction scenario (e.g., lake temperature (Jia et al., Reference Jia, Zwart, Sadler, Appling, Oliver, Markstrom, Willard, Xu, Steinbach, Read and Kumar2021b; Karpatne et al., Reference Karpatne, Watkins, Read and Kumar2017b; Read et al., Reference Read, Jia, Willard, Appling, Zwart, Oliver, Karpatne, Hansen, Hanson and Watkins2019), lake phosphorous (Hanson et al., Reference Hanson, Stillman, Jia, Karpatne, Dugan, Carey, Stachelek, Ward, Zhang and Read2020), subsurface flow (Wang et al., Reference Wang, Zhang, Chang and Li2020b)). We also see applications of PINNs in hydrology for solving PDEs for transmissivity (Guo et al., Reference Guo, Zhao, Lu and Luo2023), solute transport (Niu et al., Reference Niu, Xu, Qiu, Li and Dong2023), soil moisture (Bandai and Ghezzehei, Reference Bandai and Ghezzehei2022), groundwater flow (Cuomo et al., Reference Cuomo, De Rosa, Giampaolo, Izzo and Di Cola2023), and shallow water equations (D. Feng et al., Reference Feng, Tan and He2023; Nazari et al., Reference Nazari, Camponogara and Seman2022). In this survey, we find only one work using informed loss function terms within a meta-transfer learning framework for lake temperature modeling (Willard et al., Reference Willard, Read, Appling and Oliver2021a, Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b) incorporating conservation of energy relating to the ingoing and outgoing thermal fluxes into the lake.
Another direction is to use domain knowledge to directly alter a neural network’s architecture to implicitly encode physical consistency or other desired physical properties. However, KGML-driven architecture optimizing for physical consistency is usually understood as a hard constraint since the consistency is hardcoded into the model, whereas KGML loss function terms are a soft constraint that can depend on optimization and weights within the loss function. Other benefits from KGML loss function terms are also experienced by KGML-driven model architecture, including reducing the search space and allowing for better out-of-sample generalizability. KGML-driven model architectures have shown success in hydrology, however, it has been limited to temporal predictions for monitored sites. Examples include S. Jiang et al. (Reference Jiang, Zheng and Solomatine2020) where they show a rainfall-runoff process model can be embedded as special recurrent neural layers in a deep learning architecture, Daw and Karpatne (Reference Daw and Karpatne2019) where they show a physical intermediate neural network node as part of an monotonicity-preserving structure in the LSTM architecture for lake temperature, and more examples in the Willard et al. (Reference Willard, Jia, Xu, Steinbach and Kumar2022a) KGML survey. However, there is nothing preventing these approaches from being applied in the unmonitored scenario.
Lastly, if process-based model output is already available, such as the National Water Model streamflow outputs (NOAA, 2016), FLake model lake surface temperature outputs within ERA5 (Muñoz-Sabater et al., Reference Muñoz-Sabater, Dutra, Agustí-Panareda, Albergel, Arduini, Balsamo, Boussetta, Choulga, Harrigan and Hersbach2021), or PRMS-SNTemp simulated stream temperature (Markstrom, Reference Markstrom2012), this data can be used to help pre-train an ML model, which is known as KGML initialization. In the unmonitored prediction scenario, pre-training can be done on process-based model simulations of sites with no monitoring data. This is arguably the most accessible KGML method since there is no direct alteration of the ML approaches. By pre-training, the ML model can learn to emulate the process-based model prior to seeing training data in order to accelerate or improve the primary training. Numerous studies in water resources perform KGML-based model initialization by making use of process-based model output to inform ML model building, either to create site-specific embeddings used for similarity calculation in meta transfer learning (Ghosh et al., Reference Ghosh, Li, Tayal, Kumar and Jia2022), as a pre-training stage for source models in meta transfer learning (Willard et al., Reference Willard, Read, Appling and Oliver2021a, Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b), or as a pre-training stage for entity-aware broad-scale models (Koch and Schneider, Reference Koch and Schneider2022; Noori et al., Reference Noori, Kalin and Isik2020).
Beyond these traditional KGML approaches, there is also the concept of neural operators, which have emerged as a powerful class of ML capable of generalizing across different scenarios and scales. Unlike traditional neural networks that learn mappings between inputs and outputs with fixed dimensions, neural operators map between infinite-dimensional functional spaces (Li et al., Reference Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart and Anandkumar2020). While neural operators have not yet been directly applied to ungauged or unmonitored hydrologic time series prediction, recent studies demonstrate their potential in surrogate modeling of dynamical systems modeling for flood inundation (Sun et al., Reference Sun, Li, Lee, Huang, Scanlon and Dawson2023), geological carbon storage (Tang et al., Reference Tang, Kong and Morris2024), and groundwater flow (Taccari et al., Reference Taccari, Wang, Goswami, De Florio, Nuttall, Chen and Jimack2023). They also have the capability to increase computational efficiency within transformer architectures for scaling to high resolution or high dimensional data, specifically for vision transformers in Guibas et al. (Reference Guibas, Mardani, Li, Tao, Anandkumar and Catanzaro2021) and Pathak et al. (Reference Pathak, Subramanian, Harrington, Raja, Chattopadhyay, Mardani, Kurth, Hall, Li and Azizzadenesheli2022).
2.3.2. Augmenting process models with ML using hybrid process-ML models
In many cases, certain aspects of process-based models may be sufficient but researchers seek to use ML in conjunction with an operating process-based to address key issues. Examples include where (1) process-based model outputs or intermediate variables are useful inputs to the ML model, (2) a process-based model may model certain intermediate variables better than others that could utilize the benefits of ML, or (3) optimal performance involves choosing between process-based models and ML models, based on prediction circumstances in real-time. Using both the ML model and a process-based model simultaneously is known as a hybrid process-ML model and is the most commonly used KGML technique for unmonitored prediction. In the Willard et al. (Reference Willard, Jia, Xu, Steinbach and Kumar2022a) survey of KGML methods, they define hybrid models as either process and ML models working together for a prediction task, or a subcomponent of a process-based model being replaced by an ML model. This type of KGML method is also very accessible for domain scientists since it requires no alterations to existing ML frameworks. In this work, we do not cover the large body of work of ML predictions of process-based model parameters since these methods have been outpaced by ML for predictive performance and tend to extrapolate to new locations poorly (Nearing et al., Reference Nearing, Kratzert, Sampson, Pelissier, Klotz, Frame, Prieto and Gupta2021), but summaries can be found in Reichstein et al. (Reference Reichstein, Camps-Valls, Stevens, Jung, Denzler and Carvalhais2019) or Xu and Liang (Reference Xu and Liang2021).
The most common form of hybrid process-ML models in hydrological and water resources engineering is known as residual modeling. In residual modeling, a data-driven model is trained to predict a corrective term to the biased output of a process-based or mechanistic model. This concept goes by other names such as error-correction modeling, model post-processing, error prediction, compensation prediction, and others. Correcting these residual errors and biases has been shown to improve the skill and reliability of streamflow forecasting (Cho and Kim, Reference Cho and Kim2022; Regonda et al., Reference Regonda, Seo, Lawrence, Brown and Demargne2013), water level prediction (López López et al., Reference López López, Verkade, Weerts and Solomatine2014), and groundwater prediction (Xu and Valocchi, Reference Xu and Valocchi2015). When applying residual modeling to unmonitored prediction, the bias-correcting ML model must be trained on either a large number of sites or sites similar to the target site. Hales et al. (Reference Hales, Sowby, Williams, Nelson, Ames, Dundas and Ogden2022) demonstrate a framework to build a residual model for stream discharge prediction with the GEOGloWS ECMWF Streamflow Model that selects similar sites based on the dynamic time warping and euclidean distance time series similarity metrics. For unmonitored sites, they substitute simulated data instead of the observed data and show a substantial reduction in model bias in ungauged subbasins.
A slight alteration to the residual model is a hybrid process-ML model that takes an ML model and adds the output of a process-based model as an additional input. This adds a degree of flexibility to the modeling process compared to the standard residual model as the residual error is not modeled explicitly, and multiple process-based model outputs can be used at once. Karpatne et al. (Reference Karpatne, Watkins, Read and Kumar2017b) showed that adding the simulated output of a process-based model as one input to an ML model along with input drivers used to drive the physics-based model for lake temperature modeling can improve predictions, and a similar result was seen in Yang et al. (Reference Yang, Sun, Gentine, Liu, Wang, Yin, Du and Liu2019a) augmenting flood simulation model based on prior global flood prediction models. This hybrid modeling approach has recently been applied to unmonitored prediction as well, with Noori et al. (Reference Noori, Kalin and Isik2020) using the output of SWAT (Soil & Water Assessment Tool (Arnold et al., Reference Arnold, Srinivasan, Muttiah and Williams1998)) as an input to a feed-forward neural network for predicting monthly nutrient load prediction in unmonitored watersheds. They find that the hybrid process-ML model has greater prediction skills in unmonitored sites than the SWAT model calibrated at each individual site.
Another simple way to combine process-based models with ML models is through multi-model ensemble approaches that combine the predictions of two or more types of models. Ensembles can both provide more robust prediction and allow quantification and reduction of uncertainty. Multiple studies in hydrology have shown that using two or more process-based models with different structures improves performance and reduces prediction uncertainty in ungauged basins (Cibin et al., Reference Cibin, Athira, Sudheer and Chaubey2014; Waseem et al., Reference Waseem, Ajmal and Kim2015). Razavi and Coulibaly (Reference Razavi and Coulibaly2016) show an ensemble of both ML models and process-based models for streamflow prediction, which further reduced prediction uncertainty and outperformed individual models. However, this study is limited to building a model for an ungauged stream site using only the three most similar and closely located watersheds, as opposed to more comprehensive datasets like CAMELS.
Comparisons between different types of hybrid models are not commonly seen, as most studies tend to use only one method. In one study highlighting different hybrid models, Frame et al. (Reference Frame, Kratzert, Raney, Rahman, Salas and Nearing2021) compare three approaches, (1) LSTM residual models correcting the National Water Model (NWM), (2) a hybrid process-ML model using an LSTM that takes the output of the NWM as an additional input, and (3) a broad-scale entity-aware LSTM like we have described in Section 2.1. They find that in the unmonitored scenario, the third approach performed the best, which leads to the conclusion that the output from the NWM actually impairs the model and prevents it from learning generalizable hydrological relationships. In many KGML applications, the underlying assumption is that the process-based model is capable of reasonably good predictions and adds value to the ML approaches. Additional research is required to address when hybrid modeling is beneficial for unmonitored prediction since there are often numerous process-based models and different ways to hybridize modeling for a given environmental variable.
2.3.3. Building differentiable and learnable process-based models
Numerous efforts have been made to build KGML models that have equal or greater accuracy than existing ML approaches but with increased interpretability, transparency, and explainability using the principles of differentiable process-based (DPB) modeling (Feng et al., Reference Feng, Liu, Lawson and Shen2022b; Khandelwal et al., Reference Khandelwal, Xu, Li, Jia, Stienbach, Duffy, Nieber and Kumar2020; Shen et al., Reference Shen, Appling, Gentine, Bandai, Gupta, Tartakovsky, Baity-Jesi, Fenicia, Kifer and Li2023). The main idea of DPB models is to keep an existing geoscientific model’s structure but replace the entirety of its components with differentiable units (e.g., ML). From an ML point of view, it can be viewed as a domain-informed structural prior resulting in a modular neural network with physically meaningful components. This differs from the previously described hybrid process-ML methods that include non-differentiable process-based models or components. One recent example is shown in hydrological flow prediction by Feng et al. (Reference Feng, Liu, Lawson and Shen2022b), though similar models have been used in other applications like earth system models (Gelbrecht et al., Reference Gelbrecht, White, Bathiany and Boers2022) and molecular dynamics (AlQuraishi and Sorger, Reference AlQuraishi and Sorger2021). The DPB model proposed by Feng et al. (Reference Feng, Liu, Lawson and Shen2022b) starts with a simple backbone hydrological model (Hydrologiska Byråns Vattenbalansavdelning model (Bergström, Reference Bergström1976)), replaces parts of the model with neural networks, and couples it with a differentiable parameter learning framework (see Figure 1 in Feng et al. (Reference Feng, Liu, Lawson and Shen2022b) for a visualization). Specifically, the process model structure is implemented as a custom neural network architecture that connects units in a way that encodes the key domain process descriptions, and an additional neural network is appended to the aforementioned process-based neural network model to learn the physical parameters. The key concept is that the entire framework is differentiable from end to end, and the authors further show that the model has nearly identical performance in gauged flow prediction to the record-holding entity-aware LSTM while exhibiting interpretable physical processes and adherence to physical laws like conservation of mass. A simpler implementation is seen in Khandelwal et al. (Reference Khandelwal, Xu, Li, Jia, Stienbach, Duffy, Nieber and Kumar2020), also for streamflow, where intermediate RNN models are used to predict important process model intermediate variables (e.g., snowpack, evapotranspiration) prior to the final output layer. In both of these implementations, we see a major advantage of the DPB model is the ability to output an entire suite of environmental variables in addition to the target streamflow variable, including baseflow, evapotranspiration, water storage, and soil moisture. The DPB approach has been further demonstrated in the unmonitored prediction of hydrological flow in Feng et al. (Reference Feng, Beck, Lawson and Shen2022a), showing better performance than the entity-aware LSTM for mean flow and high flow predictions but slightly worse for low flow. The results of DPB models in both unmonitored and monitored scenarios challenge the notion that process-based model structure rigidness is undesirable as opposed to the highly flexible nature of neural networks and that maybe elements of both can be beneficial when the performance is near-identical in these specific case studies.
3. Summary and discussion
We see that many variations of the three classes of ML methodologies discussed in Section 2 have been used for predictions in unmonitored sites (Table 1). So far, entity-aware broad-scale modeling through direct concatenation of features remains the dominant approach for hydrological applications. It remains to be seen how these different methods stack up against each other when predicting different environmental variables since most of the current studies are on streamflow prediction. The evidence so far suggests that combining data from heterogeneous regions when available should be strongly considered. In Section 2.1, we saw many applications in which using all available data across heterogeneous sites was the preferred method for training ML models as opposed to fitting to individual or a subset of sites. Many recent studies continue the traditional practice of developing unsupervised, process-based, and data-driven functional similarity metrics and homogeneity criteria when selecting either specific sites or subgroups of sites to build models on to be transferred to unmonitored sites. Notably, some of these works show models built on subgroups of sites outperform models using all available sites. Additionally, the results from Frame et al. (Reference Frame, Kratzert, Raney, Rahman, Salas and Nearing2021) suggest that using a broad-scale entity-aware ML model combining data from all regions is preferable to two different hybrid process-ML frameworks that harness a well-known process-based model in the NWM. Similarly, the results from Fang et al. (Reference Fang, Kifer, Lawson, Feng and Shen2022) suggest that deep learning models perform better when fed a diverse training dataset spanning multiple regions as opposed to homogeneous dataset on a single region even when the homogeneous data is more relevant to the testing dataset and the training datasets are the same size. This can likely be attributed to the known vulnerability of ML models that perform better when fed data from a diverse or slightly perturbed dataset (e.g., from adversarial perturbations), where they are able to learn the distinctions in underlying processes (see Hao and Tao, Reference Hao and Tao2022 for an example in hydrology).
Abbreviations: DCBS: direct concatenation broad-scale; TL: transfer learning; ANN: artificial neural network (feed-forward multilayer perceptron); GNN: graph neural network; LSTM: long short-term memory neural network; MARS: multi-adaptive regression splines; MLR: multilinear regression; GBR: gradient boosting regression; GRU: gated recurrent unit; PDE: partial differential equation; RF: random forest; SVR: support vector regression; TCN: temporal convolution network; XGB: extreme gradient boosting.
It is also clear that the LSTM model remains by far the most prevalent neural network architecture for water resources time series prediction due to its natural ability to model sequences, its memory structure, and its ability to capture cumulative system status. We see that 30 of the 40 reviewed studies in Table 1 use LSTM. This aligns with existing knowledge and studies that have consistently found that LSTM is better suited for environmental time series prediction than traditional architectures without explicit cell memory (Fan et al., Reference Fan, Jiang, Xu, Zhu, Cheng and Jiang2020; Zhang et al., Reference Zhang, Zhu, Zhang, Ye and Yang2018). Even though we see the traditional ANN sometimes perform nearly as well or better (Chen et al., Reference Chen, Zhu, Jiang and Sun2020; Nogueira Filho et al., Reference Nogueira Filho, Souza Filho, Porto, Vieira Rocha, Sousa Estácio and Martins2022), the LSTM has the advantage of not having to consider time-delayed inputs, which is a critical hyperparameter, due to its recurrent structure already incorporating many previous timesteps. We find that other neural network architectures suitable for temporal data like transformers (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) and temporal convolution networks (TCN) (Lea et al., Reference Lea, Flynn, Vidal, Reiter and Hager2017) are not used much for unmonitored water resources applications compared to other disciplines doing sequential modeling such as natural language processing and bioinformatic sequence analysis where these methods have largely replaced LSTM. This is likely due to their recent development compared to LSTM and also possibly due to their lack of inclusion in major deep-learning software packages like Pytorch and Keras. One recent study by Yin et al., Reference Yin, Zhu, Zhang, Xing, Xia, Liu and Zhang2023 seems to suggest that transformers outperform LSTM for rainfall-runoff prediction in the United States, but still the vast majority of transformer applications in hydrology are in the context of prediction in monitored sites (Liu et al., Reference Liu, Liu and Mu2022; Liu et al., Reference Liu, Bian and Shen2023; Wang and Tang, Reference Wang and Tang2023; Wei et al., Reference Wei, Wang, Schmalz, Hagan and Duan2023; Xu et al., Reference Xu, Fan, Luo, Li, Jeong and Xu2023a; Yin et al., Reference Yin, Guo, Zhang, Chen and Zhang2022). How transformers are fair in predicting in the unmonitored scenario will be an important research direction because results have been mixed in the monitored scenario when compared to LSTM with some showing improvement (e.g., Liu et al. (Reference Liu, Liu and Mu2022), Yin et al. (Reference Yin, Guo, Zhang, Chen and Zhang2022), and some not e.g., Liu et al. (Reference Liu, Bian and Shen2023), Wei et al. (Reference Wei, Wang, Schmalz, Hagan and Duan2023)).
We also find that most studies are focused on daily predictions, although a few studies predict at a monthly, annual, or hourly time scales based on desired output resolutions, data availability computational efficiency, or available computational power. For instance, monthly predictions may be desirable over daily due to the ability to use more interpretable, computationally efficient bootstrap ensembles, and easy-to-implement classical ML models (Weierbach et al., Reference Weierbach, Lima, Willard, Hendrix, Christianson, Lubich and Varadharajan2022). Increased computational efficiency can also enable running a large number (e.g., millions) of model trainings or evaluations for parameter sensitivity or uncertainty analysis.
Spatially, the majority of studies cover the United States at 27 out of 40 studies. Fifteen of these span the entire conterminous United States, while 10 are specific regions. The remaining studies are specific to certain countries and span Asia (seven studies), South America (one study), Europe (three studies), other North America (two), and two studies cover multiple continents. The strong focus on the United States can be due to its large land area with rivers alongside the economic capability to have advanced monitoring stations where data are freely available for study worldwide.
We also see the prevalence of the CAMELS dataset being used in streamflow studies; it is used in 9 out of the 40 studies in Table 1. CAMELS serves as a transformative continental-scale benchmark dataset for data-driven catchment science with its combined high-quality streamflow measurements spanning 671 catchments, climate-forcing data, and catchment characteristics like land cover and topography. However, we note that it is limited to “unimpaired” catchments that are not influenced by human management via dams. In addition to dam-managed catchments, catchments close to and within urban areas excluded from CAMELS are more likely to be impacted by roadways or other infrastructure. There are over 800,000 dammed reservoirs affecting rivers around the world, including over 90,000 in the United States (International Rivers, 2007; US Army Core of Engineers, 2020). The effect of dammed reservoirs on downstream temperature is also complicated by variable human-managed depth releases and changing demands for water and energy that affect decision making (Risley et al., Reference Risley, Constantz, Essaid and Rounds2010). These limitations may hamper the ability of current models to extrapolate to real-world scenarios where many catchments of high economic and societal value are either strongly human-impacted or data-sparse.
3.1. Open questions for further research
Though the works reviewed in this survey encompass many techniques and applications, there are still many open issues to be addressed as the water resources scientific community increasingly adopts ML approaches for unmonitored prediction. Here we highlight questions for further research that are widely applicable and agnostic to any specific target environmental variable and should be considered as the field moves forward.
3.1.1. Is more data always better?
We have seen that deep learning models in particular benefit from large datasets of heterogeneous entities, challenging the longstanding notion transferring models between systems requires that they must be functionally similar (Guo et al., Reference Guo, Zhang, Zhang and Wang2021; Razavi and Coulibaly, Reference Razavi and Coulibaly2013). Further research is needed to develop robust frameworks to discern how many sites need to be selected for training, what similarity needs to be leveraged to do so, and if excluding sites or regions can benefit broad-scale ML models when given different environmental variable prediction tasks. We hypothesize that excluding sites deemed dissimilar often limits the spectrum of hydrological heterogeneity, and the utilization of all available stream sites ensures a more comprehensive understanding of the system by allowing the model to learn from a wide range of hydrological behaviors to more effectively generalize to unseen scenarios. This is supported by work in streamflow modeling that has explicitly analyzed the effect of merging data from heterogeneous entities on prediction performance. (Fang et al., Reference Fang, Kifer, Lawson, Feng and Shen2022) is a great example demonstrating one step in deciding between using all available data versus a subset of functionally similar entities. Furthermore, in stream temperature modeling, Willard (Reference Willard2023) also finds using more data is beneficial for nearly all regions in the United States, and both regional modeling and single-site modeling can benefit from being pre-trained on all available data. Moving forward we expect the use of a maximal amount of training data to be the default approach, especially given the advancements in computational power and hydrology having comparatively smaller datasets than other fields where deep learning models are also commonly used like natural language processing, social media, and e-commerce.
3.1.2. How do we select optimal training data and input features for prediction?
If it is not feasible or desirable to use all available data, this further begs the question of how to optimally select functionally similar entities to construct a training dataset to minimize target site prediction error. Many approaches exist to derive an unsupervised similarity between sites including using network science (Ciulla et al., Reference Ciulla, Willard, Weierbach and Varadharajan2022), using meta-learning to select training data (e.g., active learning-based data selection (Al-Shedivat et al., Reference Al-Shedivat, Li, Xing and Talwalkar2021)), or comparing existing expert-derived metrics like hydrological signatures (McMillan, Reference McMillan2021). There are also methods to combine training for large-scale entity-aware modeling while also specifying a target region or class of similar sites exist (further explained in Section 3.1.3), and this is another example of where functional similarity could be applied.
Approaches also exist to use ML frameworks like neural networks to develop the similarity encodings themselves, which could be used to select subgroups of sites. Kratzert et al., (Reference Kratzert, Klotz, Shalev, Klambauer, Hochreiter and Nearing2019c, Reference Kratzert, Klotz, Shalev, Klambauer, Hochreiter and Nearing2019d) demonstrate a custom LSTM architecture that delineates static and dynamic inputs, feeding the former to the LSTM input gate and the latter to the remaining gates. The idea is to use the input gate nodes to encode the functional similarity between stream gauge locations based on the site characteristics alone, and they show this to reveal interpretable hydrological similarity that aligns with existing hydrological knowledge. This framework as-is will not exclude any sites directly but still offers insight into the usefulness of embedded functional similarity. We also see the static feature encoding from Section 2.1.2, differing from the previously mentioned method by using a separate ANN for static features as opposed to different gates in the same LSTM. Future research in developing these similarity encodings can also extend into adversarial-based ML methods that could discern valuable training entities.
Numerous other factors can be considered in training dataset construction when deciding whether to include entities other than functional similarity as well. First, the training data should be representative of all types of entities relevant to the prediction tasks, and not too biased towards a particular region or type of site which can correspondingly bias results. When building a model to transfer to a particular set of unmonitored sites, it must be considered whether the training data is representative of those target sites because environmental monitoring paradigms from the past that make up the dataset may not be in line with current priorities. Another consideration is the quality of data, where some sites may have higher quality of data than other sites which may have some highly uncertain characteristics. In cases like these, uncertainty quantification methods can be used to increase the reliability of predictions (Abdar et al., Reference Abdar, Pourpanah, Hussain, Rezazadegan, Liu, Ghavamzadeh, Fieguth, Cao, Khosravi and Acharya2021), or different weighting can be assigned to different entities based on uncertainty metrics or what the training dataset needs to be representative. It has also been shown that assigning a vector of random values as a surrogate for catchment physical descriptors can be sufficient in certain applications (Li et al., Reference Li, Khandelwal, Jia, Cutler, Ghosh, Renganathan, Xu, Tayal, Nieber and Duffy2022).
Furthermore, hydrologic prediction problems often contain a vast array of possible input features and input feature combinations spanning both dynamic forcing data like daily meteorology and static site characteristics. The process of feature selection aims to find the optimal subset of input features that (1) contains sufficient predictive information for an accuracy model, and (2) excludes redundant and uninformative features for better computational efficiency and model interpretability (Dhal and Azad, Reference Dhal and Azad2022). Notably, the majority of works reviewed in this study do not incorporate data-driven or statistical feature selection methods, and instead explicitly or presumably rely on expert domain knowledge to select inputs. This contrasts with many disciplines applying ML regression where feature selection is normalized and often deemed necessary (e.g., medical imaging (Remeseiro and Bolon-Canedo, Reference Remeseiro and Bolon-Canedo2019), multi-view learning (R. Zhang et al., Reference Zhang, Nie, Li and Wei2019), finance (Khan et al., Reference Khan, Ghazanfar, Azam, Karami, Alyoubi and Alfakeeh2020)). However, modern large-sample hydrology datasets offer a wealth of watershed, catchment, and individual site-specific characteristics and metrics that could serve as an opportunity to apply feature selection methods. For instance, the StreamCat dataset (Hill et al., Reference Hill, Weber, Leibowitz, Olsen and Thornbrugh2016) contains over 600 metrics for 2.65 million stream segments across the United, and the Caravan dataset (Kratzert et al., Reference Kratzert, Nearing, Addor, Erickson, Gauch, Gilon, Gudmundsson, Hassidim, Klotz and Nevo2023) contains 70 catchment attributes for 6830 catchments across the world.
Feature selection methods span three primary categories. Filter feature selection methods rank variables based on their statistical properties alongside the target variable without considering the ML model itself. Popular filter techniques base rankings on correlation coefficients, mutual information, and information gain per feature. These methods have low computational cost compared to other methods, however, they contain the drawback of not considering the interaction with the underlying ML model’s performance. Wrapper feature selection methods, on the other hand, assess the quality of variables by evaluating the performance of a specific ML model using a subset of features. Common wrapper methods include forward selection, backward elimination, boruta (Kursa and Rudnicki, Reference Kursa and Rudnicki2010), and recursive feature elimination. These methods have the advantage of considering the interaction between variables and the model’s performance, however, they are more computationally expensive due to an often large number of model trainings and evaluations, especially for datasets which a large number of candidate input features. Embedded (or intrinsic) feature selection methods are models that automatically already perform feature selection during training. Techniques like Least Absolute Shrinkage and Selection Operator and Elastic Net regularization automatically penalize the coefficients of irrelevant features during training, encouraging their removal. Additionally, random forest and similar decision tree methods also contain embedded feature selection as they will not include irrelevant features in the decision trees.
The size and dimensionality of the hydrological dataset play a significant role in selecting a feature selection method. For large datasets with hundreds or thousands of possible features, filter methods can provide a computationally efficient initial screening. In contrast, wrapper methods such as recursive feature elimination or forward feature selection, are suitable for smaller datasets with fewer predictors, as they explicitly consider the regression model’s performance. As hydrological modeling increasingly incorporates deep learning, the use of embedded methods may not be desirable since these methods generally exist among classical ML models. There is room for the hydrology community to develop standard processes to select optimal features for a given target variable and set of modeling sites. Furthermore, there needs to be methods to combine datasets where, for example, site-specific characteristics that need to be considered in a feature selection framework exist across multiple data sources.
3.1.3. How should site characteristics be used in machine learning models for unmonitored prediction?
We have seen that the generalization of ML models to unmonitored sites requires the availability of site characteristics (Kratzert et al., Reference Kratzert, Herrnegger, Klotz, Hochreiter and Klambauer2019a; Xie et al., Reference Xie, Liu, Tian, Wang, Bai and Liu2022), but that the science about how to use them is uncertain. The entity-aware models listed in this study tend to exhibit performance increases when such characteristics are included. For example, Rasheed et al. (Reference Rasheed, Aravamudan, Sefidmazgi, Anagnostopoulos and Nikolopoulos2022) find site characteristics like soil porosity, forest fraction, and potential evapotranspiration all exhibit significant importance for flood peak prediction, and Xie et al. (Reference Xie, Liu, Tian, Wang, Bai and Liu2022) find that the combined catchment characteristics make up 20% of the total feature importances for a continental-scale baseflow prediction model. However, the result from Li et al. (Reference Li, Khandelwal, Jia, Cutler, Ghosh, Renganathan, Xu, Tayal, Nieber and Duffy2022) showing random values substituted for site characteristics still improve performance in the temporal prediction scenario needs to be further investigated and compared in other applications. Many methods in this survey use site characteristics in different ways, and an open question remains of how to best add site characteristics to an ML model in a given task.
Throughout this review, we see several ways to incorporate site characteristics into ML model architecture and frameworks. The most common way is in an entity-aware model using concatenated input features as seen in Section 2.1.1, presumably based on landmark results from the streamflow modeling community. However, it has also been demonstrated that using a graph neural network approach using these site characteristics to determine the similarity between sites can slightly outperform the concatenated input approach (Sun et al., Reference Sun, Jiang, Mudunuru and Chen2021a). Site characteristics have also been used to build and predict with a metamodel the performance of different local models to be transferred to an unmonitored site (Ghosh et al., Reference Ghosh, Li, Tayal, Kumar and Jia2022; Willard et al., Reference Willard, Read, Appling and Oliver2021a, Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b). Other works mentioned in Section 2.1.2 demonstrate the effectiveness of learning ML-based encodings of site characteristics as opposed to using them as-is (Ghosh et al., Reference Ghosh, Li, Tayal, Kumar and Jia2022; Tayal et al., Reference Tayal, Jia, Ghosh, Willard, Read and Kumar2022). However, these approaches have not been tested against the concatenated input entity-aware approach commonly seen in other works which is needed to assess their role in modeling unmonitored sites.
Furthermore, water management stakeholders, decision-makers, and forecasters often seek to prioritize specific individual locations that are unmonitored but the site characteristics are known. Many of the broad-scale approaches mentioned in this survey are built without any knowledge of the specific testing sites they are going to be applied to. While training without any knowledge of the testing data is a common practice in supervised machine learning, efforts to predict in unmonitored sites may benefit from including information on specific test sites during training. For example, characteristics from the test sites are used in the meta-transfer learning framework described in Section 2.2 to select source models to apply to the target or test system. Surveys on transfer learning (Niu et al., Reference Niu, Liu, Wang and Song2020; Pan and Yang, 2010) have described this distinction as the difference between inductive transfer learning, where the goal is to find generalizable rules that apply to completely unseen data, with transductive transfer learning, where the input data to the target or test system is known and can be used in the transfer learning framework. Transductive transfer learning methods like meta transfer learning have been proposed, but there is a lack of transductive methods that can harness the power of the highly successful entity-aware broad-scale models. In the same way that transfer learning has facilitated the pre-training of ML models in hydrology on data-rich watersheds to be transferred and fine tuned efficiently with little data in a new watershed, for example in flood prediction (Kimura et al., Reference Kimura, Yoshinaga, Sekijima, Azechi and Baba2019), we imagine there could be ways to harness to benefits of large-scale entity-aware modeling and also fine tune those same models to a specific region or class of sites that have known site characteristics. For example, the entity-aware models using all available data described in Section 2.1 could be fine-tuned to specific relevant subgroups, or the individual source models described in transfer learning approaches in Section 2.2 could be pre-trained using all available data (Willard, Reference Willard2023).
There is also the issue of the non-stationary nature of many site characteristics. These characteristics are typically derived from synthesized data products that treat them as static values such as the Geospatial Attributes of Gages for Evaluating Streamflow (Falcone, Reference Falcone2011) containing basin topography, climate, land cover, soil, and geology), StreamCat (Hill et al., Reference Hill, Weber, Leibowitz, Olsen and Thornbrugh2016), and the dataset in Willard et al. (Reference Willard, Read, Appling and Oliver2021a) (lake characteristics like bathymetry, surface area, stratification indices, and water clarity estimates). Though this treatment of site characteristics as static is intuitive for properties that do not evolve quickly (e.g., geology), in reality, properties such as land cover, land use, or even climate are dynamic in nature and evolve at different time scales. This can affect prediction performance in cases where the dynamic nature of certain characteristics treated as static is vital to prediction. For example, land use is a key dynamic predictor for river water quality in areas undergoing urbanization (Yao et al., Reference Yao, Chen, He, Cui, Mo, Pang and Chen2023), but is treated as static in most hydrological ML models. In lake temperature modeling, water clarity is treated as static in Willard et al. (Reference Willard, Read, Appling, Oliver, Jia and Kumar2021b) but realistically has a notable dynamic effect on water column temperatures (Rose et al., Reference Rose, Winslow, Read and Hansen2016). Though this problem exists in both monitored and unmonitored scenarios, characteristics are particularly important in unmonitored site prediction since often that is the only knowledge available concerning a location. As data collection from environmental sensors continues to improve, this highlights a need for new geospatial datasets and methods to represent dynamic characteristics at multiple time points (e.g., National Land Cover Database (Homer et al., Reference Homer, Fry and Barnes2012).
3.1.4. How can we leverage process understanding for prediction in unmonitored sites?
The success of ML models achieving better prediction accuracy across many hydrological and water resources variables compared to process-based models has led to the question posed by Nearing et al., Reference Nearing, Kratzert, Sampson, Pelissier, Klotz, Frame, Prieto and Gupta2021 of, “What role will hydrological science play in the age of machine learning?”. Given the relevant works reviewed in this study showing mixed results comparing KGML approaches using process understanding with domain-agnostic black box approaches, more research is required to address the role of domain knowledge in PUB prediction for unmonitored sites. From Section 2.3 we see that using graph neural networks has the potential to encode spatial context relevant for predictions and improve over existing methods, but also that hybrid models have not been as effective as domain-agnostic entity-aware LSTM counterparts. A key research direction will be finding which context is relevant to encode in graphs or other similarity or distance-based structures, whether that be spatial or based on expert domain knowledge. A preferable alternative to existing hybrid process-ML models may be the DPB models explained in Section 2.3.3, which exhibit many side benefits like being able to output accurate intermediate variables and demonstrating interpretability, but the performance achieved remains similar to existing process-agnostic models like the entity-aware LSTM models. There is potential to further research and develop these DPB approaches, for instance, they stand to benefit from assimilating multiple data sources since they simulate numerous additional variables.
KGML modeling techniques, like informed loss functions, informed model architecture, and hybrid modeling can be considered during method development. For example, knowledge-guided loss function terms can impose structure on the solution search space in the absence of labeled target data by forcing model output to conform to physical laws (e.g., conservation of energy or mass). Examples of successful implementations of knowledge-guided loss functions to improve temporal prediction include the conservation of energy-based term to predict lake temperature (Read et al., Reference Read, Jia, Willard, Appling, Zwart, Oliver, Karpatne, Hansen, Hanson and Watkins2019), power-scaling law-based term to predict lake phosphorous concentration (Hanson et al., Reference Hanson, Stillman, Jia, Karpatne, Dugan, Carey, Stachelek, Ward, Zhang and Read2020), and advection–dispersion equation-based terms to predict subsurface transport states (He et al., Reference He, Barajas-Solano, Tartakovsky and Tartakovsky2020). These results show that informed loss functions can improve the physical realism of the predictions, reduce the data required for good prediction performance, and also improve generalization to out-of-sample scenarios. Since loss function terms are generally calculated on the model output and do not require target variable data, they can easily be transferred from temporal predictions to the unmonitored prediction scenario.
Knowledge-guided architecture can similarly make use of the domain-specific characteristics of the problem being solved to improve prediction and impose constraints on model prediction but has not been applied in the unmonitored scenario. As opposed to soft constraints as imposed by a loss function term, architectural modifications can impose hard constraints. Successful examples of modified neural network architectures for hydrological prediction include a modified LSTM with monotonicity constraints for lake temperatures at different depths (Daw and Karpatne, Reference Daw and Karpatne2019), mass-conserving modified LSTMs for streamflow prediction (Hoedt et al., Reference Hoedt, Kratzert, Klotz, Halmich, Holzleitner, Nearing, Hochreiter and Klambauer2021), and an LSTM architecture that includes auxiliary intermediate processes that connect weather drivers to streamflow Khandelwal et al., Reference Khandelwal, Xu, Li, Jia, Stienbach, Duffy, Nieber and Kumar2020. Many hydrological prediction tasks involve governing equations such as conservation laws or equations of state that could be leveraged in similar ways to improve ML performance in unmonitored sites.
We also see from Section 2.3 that hybrid process and ML models are also tool to consider for ungauged and unmonitored prediction. However, comparisons between different types of hybrid models are not commonly seen, as most studies we noted tend to use only one method. However, different types should be considered based on the context of the task. For example, if multiple process-based models are available then a multi-model ensemble or using multiple process-based outputs as inputs to an ML model can be considered. Or, if part of the physical process is well-known and modeled compared to more uncertain components, researchers can consider replacing only part of the process-based model with an ML model component.
3.1.5. How do we perform uncertainty quantification for predictions in unmonitored sites?
Uncertainties in ML efforts for prediction in unmonitored sites can arise from various sources, including model structure and input data quality. Through uncertainty quantification (UQ) techniques, decision-makers can understand the limitations of the predictions and make informed decisions. UQ also enables model refinement, identification of data gaps, and prioritization of monitoring efforts in ungauged basins. Various techniques exist in UQ for ML (Abdar et al., Reference Abdar, Pourpanah, Hussain, Rezazadegan, Liu, Ghavamzadeh, Fieguth, Cao, Khosravi and Acharya2021), including Bayesian deep learning (Wang and Yeung, Reference Wang and Yeung2020), dropout-based methods (Gal and Ghahramani, Reference Gal and Ghahramani2016), Gaussian processes, and ensemble techniques. The concept of Bayesian deep learning is to incorporate prior knowledge and uncertainty by defining a full probability distribution for neural network parameters as opposed to a point estimate, which allows for the estimation of posterior distributions. These posterior distributions capture the uncertainty in the predictions and can be used to generate probabilistic forecasts in time series modeling. Gaussian processes similarly do Bayesian inference but over a particular function rather than a deep neural network, and dropout methods approximate Bayesian ML by using a common regularization technique to randomly set a fraction of the parameters to zero, effectively “dropping them out” for a particular forward pass. This allows for the creation of an ensemble of models from a single model.
Using ensembles of models for prediction is a longstanding technique in hydrology that spans both process-based models (Thielen et al., Reference Thielen, Schaake, Hartman and Buizza2008; Troin et al., Reference Troin, Arsenault, Wood, Brissette and Martel2021) and more recently ML models (Zounemat-Kermani et al., Reference Zounemat-Kermani, Batelaan, Fadaee and Hinkelmann2021). Ensemble learning is a general meta approach to model building that combines the predictions from multiple models for both UQ and better predictive performance. In traditional water resources prediction, ideally, models in the ensemble will differ with respect to either meteorological input dataset (e.g., He et al., Reference He, Wetterhall, Cloke, Pappenberger, Wilson, Freer and McGregor2009), process-based model parameters (e.g., Seibert and Beven, Reference Seibert and Beven2009) or multiple process-based model structures (e.g., Moore et al., Reference Moore, Mesman, Ladwig, Feldbauer, Olsson, Pilla, Shatwell, Venkiteswaran, Delany and Dugan2021). Different types of techniques are seen across ensemble learning more generally in the ML community, with common techniques such as (1) bagging, where many models are fit on different samples of the same dataset and averaging the predictions, (2) stacking, where different models types are fit on the same data and a separate model is used to learn how to combine the predictions, and (3) boosting, where ensemble members are added sequentially to correct the predictions made by previous models. Some of the main advantages of model ensembles in both cases is that the uncertainty in the predictions can be easily estimated and predictions can become more robust, leading them to be ubiquitous within many forecasting disciplines. Diversity in models is key, as model skill generally improves more from model diversity rather than from a larger ensemble (DelSole et al., Reference DelSole, Nattala and Tippett2014).
There are key differences in ensemble techniques in process-based modeling versus ML. For instance, expert-calibrated parameters have very specific meanings in process-based models whereas the analogous parameters in ML (usually known as weights) are more abstract and characteristic of a black box. When tweaking parameters between models to assemble an ensemble, physical realism is important in the process-based model case. Parameterization has a rich history in process-based models and the work can be very domain-specific, whereas ML ensemble techniques are often done using existing code libraries through a domain-agnostic process. Furthermore, ML ensemble techniques usually do not modify input datasets, though they could through adding noise (Brownlee, Reference Brownlee2018) or by using different data products (e.g., for meteorology).
We see most ML applications reviewed in this work do not attempt to use UQ techniques even though the few that do, see positive results (e.g., the use of ensembles for stream temperature (Weierbach et al., Reference Weierbach, Lima, Willard, Hendrix, Christianson, Lubich and Varadharajan2022), streamflow (Feng et al., Reference Feng, Lawson and Shen2021), and water level (Corns et al., Reference Corns, Long, Hale, Kanwar and Vanfossan2022)). A recent survey by Zounemat-Kermani et al., Reference Zounemat-Kermani, Batelaan, Fadaee and Hinkelmann2021 finds that ensemble ML strategies demonstrate “absolute superiority” compared to regular (individual) ML model learning in hydrology, and this result has also been seen in the machine learning community more generally for neural networks (Hansen and Salamon, Reference Hansen and Salamon1990). Many opportunities exist to develop ensemble frameworks in water resource prediction that harness numerous diverse ML models. In the same way that the hydrology community often uses ensembles of different process-based model structures, the many different architectures and hyperparameters in deep learning networks can achieve a similar diversity. Given the common entity-aware broad-scale modeling approach seen widely throughout this review, the opportunity exists to use resampling techniques like bootstrap aggregation (Breiman, Reference Breiman1996) to vary training data while maintaining broad coverage, as seen in Weierbach et al., Reference Weierbach, Lima, Willard, Hendrix, Christianson, Lubich and Varadharajan2022 for stream temperature. Other ensemble methods like in Feng et al., Reference Feng, Lawson and Shen2021 vary which site characteristics are used as inputs to LSTMs for streamflow prediction.
3.1.6. What is the role of explainable AI in predictions for unmonitored sites?
Historically, the difference between ML methods and more process-based or mechanistic methods has been described as a tradeoff between “predictive performance” and “explainability” (Lipton, Reference Lipton2018). However, there has been a deluge of advances in recent years in the field of explainable AI (XAI) (Arrieta et al., Reference Arrieta, Diaz-Rodriguez, Del Ser, Bennetot, Tabik, Barbado, Garcia, Gil-Lopez, Molina and Benjamins2020) and applications of these are increasingly being seen in geosciences (Başağaoğlu et al., Reference Başağaoğlu, Chakraborty, Lago, Gutierrez, Şahinli, Giacomoni, Furl, Mirchi, Moriasi and Şengor2022; Mamalakis et al., Reference Mamalakis, Barnes and Ebert-Uphoff2023). For example, recent work has shown how XAI can help to calibrate model trust and provide meaningful post-hoc interpretations (Toms et al., Reference Toms, Barnes and Ebert-Uphoff2020), identify how to fine-tune poor-performing models (Ebert-Uphoff and Hilburn, Reference Ebert-Uphoff and Hilburn2020), and also accelerate scientific discovery (Mamalakis et al., Reference Mamalakis, Barnes and Ebert-Uphoff2022). This has led to a change in the narrative of the performance and explainability tradeoff as calls are increasingly made for the water resources community to adopt ML as a complementary or primary avenue toward scientific discovery (Shen et al., Reference Shen, Laloy, Elshorbagy, Albert, Bales, Chang, Ganguly, Hsu, Kifer and Fang2018). Though the majority of work using XAI in water resources time series prediction has been seen in the temporal prediction scenario (e.g., Kratzert et al., Reference Kratzert, Herrnegger, Klotz, Hochreiter and Klambauer2019a; Lees et al., Reference Lees, Reece, Kratzert, Klotz, Gauch, De Bruijn, Kumar Sahu, Greve, Slater and Dadson2022), analysis of how ML models are able to learn and transfer hydrologic understanding for predictions in unmonitored sites can help address one of the most fundamental problems of “transferability” in hydrology.
We find that many water resources studies still use classical ML models like random forest or XGBoost in part due to their ease of interpretability. Initial investigations of the interpretability of deep learning frameworks have mostly addressed simple questions like feature attribution and sensitivity (e.g., Potdar et al., Reference Potdar, Kirstetter, Woods and Saharia2021; Sun et al., Reference Sun, Jiang, Mudunuru and Chen2021a). The concept of DPB models discussed in Section 2.3.3 shows potential to take this further and make an end-to-end interpretable model mimicking environmental processes but with the trainability and flexibility of deep neural networks. DPB models can provide more extensive interpretability compared to simpler feature attribution methods by being able to represent intermediate process variables explicitly in the neural network with the capability of extracting their relationship to the inputs and outputs.
Future work on XAI for unmonitored site prediction can pose research questions in directions that harness the existing highly successful ML models to both refine theoretical underpinnings and add to the current hydrologic or other process understandings surrounding regionalizations to unmonitored sites. For example, methods like layerwise relevance propagation, integrated gradients, or Shapley additive explanations (SHAP) (Molnar, Reference Molnar2020) could be used to explore causations and attributions of observed variability in situations where ML predicts more accurately than existing process-based regionalization approaches. Both temporal and spatial attributes can be considered, for example when using methods like SHAP with LSTM the attributions of any inputs along the sequence length can be used to see how far back in time the LSTM is using its memory to perform predictions, or in GNNs to see where in space the knowledge is being drawn for prediction (Ying et al., Reference Ying, Bourgeois, You, Zitnik and Leskovec2019).
4. Conclusion
The use of ML for unmonitored environmental variable prediction is an important research topic in hydrology and water resources engineering, especially given the urgent need to monitor the effects of climate change and urbanization on our natural and man-made water systems. In this article, we review the latest methodological advances in ML for unmonitored prediction using entity-aware deep learning models, transfer learning, and knowledge-guided ML models. We summarize the patterns and extent of these different approaches and enumerate questions for future research. Addressing these questions sufficiently will likely require the training of interdisciplinary water resources ML scientists and also the fostering of interdisciplinary collaborations between ML and domain scientists. As the field of ML for water resources progresses, we see many of these open questions can also augment domain science understanding in addition to improving prediction performance and advancing ML science. We hope this survey can provide researchers with state-of-the-art knowledge of ML for unmonitored prediction, offer the opportunity for cross-fertilization between ML practitioners and domain scientists, and provide guidelines for the future.
Acronyms/Abbreviations
- AI
-
Artificial intelligence
- ANN
-
Artificial neural network (feed forward)
- CAMELS
-
Catchment Attributes and Meteorology for Large-sample Studies
- DCBS
-
Direct concatenation broad-scale
- DPB
-
Differentiable process-based
- GNN
-
Graph neural network
- GRU
-
Gated recurrent unit
- KGML
-
Knowledge-guided machine learning
- LSTM
-
Long short-term memory
- MARS
-
Multi-adaptive regression splines
- ML
-
Machine learning
- NWM
-
National Water Model
- RF
-
Random forest
- XGB/XGBoost
-
Extreme gradient boosting
- SHAP
-
SHapley Additive exPlanations
- SVR
-
Support vector regression
- TCN
-
Temporal convolutional network
- XAI
-
eXplainable artificial intelligence
Acknowledgements
We are grateful for the editorial assistance of Somya Sharma, Kelly Lindsay, and Rahul Ghosh. We also acknowledge the helpful comments from the anonymous reviewers, which helped improve this manuscript.
Author contributions
Jared Willard: Writing – Original Draft Preparation (lead); Conceptualization (equal); Data (literature) Curation (lead); Investigation (equal); Methodology (equal). Charuleka Varadharajan: Project Administration (supporting); Writing – Review & Editing (equal); Supervision (equal); Funding Acquisition (equal). Xiaowei Jia: Conceptualization (supporting); Writing – Review & Editing (supporting). Vipin Kumar: Conceptualization (equal); Investigation (equal); Project Administration (lead); Writing – Review & Editing (equal); Supervision (lead); Methodology (equal); Funding Acquisition (equal).
Competing interest
Vipin Kumar is on the advisory board for the Environmental Data Science journal.
Data availability statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Funding statement
This research is funded, in part, by NSF grants numbers 2313174, 2147195, 2239175, 2316305, 1934721 (HDR program), NSF LEAP Science and Technology Center award #2019625, and National AI Research Institutes Competitive Award no. 2023-67021-39829. Additional support was provided by the U.S. Department of Energy, Office of Science, Biological and Environmental Research Program for the iNAIADS DOE Early Career Award under contract no. DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231 under the NESAP for Learning program. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.
Ethical standards
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.