Impact statement
In the realm of water and environmental engineering, the data-driven models have gained a lot of traction over the years. While the adoption of these advanced frameworks is an ongoing process, the predominant focus has traditionally centered on refining the models themselves and their internal computational architecture –a perspective encapsulated by the model-centric approach. While these are quite fundamental in reaching a more profound understanding about what these models are capable of, they often overlook a fundamental tenet: The reliability, correctness, and accessibility of the data underpinning these models. An alternative approach, advocating for a paradigm shift, prioritizes elevating data to the forefront. Emphasizing the systematic enhancement of existing datasets and the formulation of frameworks to optimize data collection schemes, this perspective advocates a move toward a more data-centric paradigm in water and environmental engineering. However, this transformative shift is not without its challenges, requiring a nuanced strategy for smart data collection. Equally critical is the ethical and accurate handling of data, ensuring universal availability while upholding the rights of individuals and other legal entities involved in the process. This article underscores the significance of embracing a data-centric perspective, anticipating its far-reaching impact on shaping the future trajectory of water and environmental engineering practices.
Introduction
Data-driven frameworks, including machine-learning (ML) models, have emerged as a prominent focus and a topical subject in various engineering disciplines, notably in the realm of water and environmental engineering (Solomatine and Ostfeld, Reference Solomatine and Ostfeld2008; Giustolisi and Savic, Reference Giustolisi and Savic2009; Araghinejad, Reference Araghinejad2013). Whether it involves a more efficient optimization algorithm (e.g., Jalili et al., Reference Jalili, Najarchi, Shabanlou and Jafarinia2023; Wu et al., Reference Wu, Wang, Hu, Tao and Dong2023), employing meticulous data mining methods (e.g., Aslam et al., Reference Aslam, Maqsoom, Cheema, Ullah, Alharbi and Imran2022; Beig Zali et al., Reference Beig Zali, Latifi, Javadi and Farmani2023; Zolghadr-Asli et al., Reference Zolghadr-Asli, Naghdyzadegan Jahromi, Wan, Enayati, Naghdizadegan Jahromi, Tahmasebi Nasab, Tiefenbacher and Pourghasemi2023), developing sophisticated ML models (e.g., Ray et al., Reference Ray, Verma, Singh, Ganesapillai and Kwon2023; Sun et al., Reference Sun, Zhu, Tan, Li, Li, Deng, Zhang, Liu and Zhu2023), or, more recently, utilizing large-language models such as ChatGPT (e.g., Foroumandi et al., Reference Foroumandi, Moradkhani, Sanchez‐Vila, Singha, Castelletti and Destouni2023; Halloran et al., Reference Halloran, Mhanna and Brunner2023), the core premise of this sub-discipline, often referred to as hydroinformatics within the domain of water and hydrology-related science, lies in the potential of computational intelligence (CI) and, possibly, artificial intelligence (AI) to reshape the future of this field (Makropoulos and Savić, Reference Makropoulos and Savić2019; Loucks, Reference Loucks2023). In essence, hydroinformatics can be viewed as a management philosophy enabled by (CI/AI) technology, and its primary objective is to establish a systematic approach to representing and comprehending the intricate and multidimensional phenomena prevalent in water management. On that note, it is often believed that these technologies hold the promise of offering alternative perspectives on existing challenges, enabling more efficient problem-solving, and devising economically and environmentally sustainable solutions. Some prime examples of this include leakage detection (e.g., Rajasekaran and Kothandaraman, Reference Rajasekaran and Kothandaraman2024), elucidating the underlying causes of abnormal hydro-climatological behaviors (e.g., Zolghadr-Asli et al., Reference Zolghadr-Asli, Naghdyzadegan Jahromi, Wan, Enayati, Naghdizadegan Jahromi, Tahmasebi Nasab, Tiefenbacher and Pourghasemi2023), facilitating a better understanding of the impacts of extreme events such as floods (Adnan et al., Reference Adnan, Siam, Kabir, Kabir, Ahmed, Hassan, Rahman and Dewan2023), and predicting droughts (Piri et al., Reference Piri, Abdolahipour and Keshtegar2023), among others. This subject remains topical, and rapidly evolving, with numerous researchers continually exploring novel approaches to leverage the potential of these frameworks within the context of water-related sciences.
When it comes to water-related challenges, a brief overview of the most current and trending topics in hydroinformatics reveals a significant focus on adopting and fine-tuning sophisticated models (e.g., Bozorg-Haddad et al., Reference Bozorg-Haddad, Latifi, Bozorgi, Rajabi, Naeeni and Loáiciga2018; Yaseen et al., Reference Yaseen, Sulaiman, Deo and Chau2019) and/or comparing the performance of these models (e.g., Chen et al., Reference Chen, Chen, Zhou, Huang, Qi, Shen, Liu, Zuo, Zou, Wang, Zhang, Chen, Chen, Deng and Ren2020; Yaghoubzadeh-Bavandpour et al., Reference Yaghoubzadeh-Bavandpour, Bozorg-Haddad, Rajabi, Zolghadr-Asli and Chu2022), that is, the model-centric approach. In theory, these model-centric efforts have yielded promising results (e.g., Sun and Scanlon, Reference Sun and Scanlon2019; Aliashrafi et al., Reference Aliashrafi, Zhang, Groenewegen and Peleato2021; Ghobadi and Kang, Reference Ghobadi and Kang2023). Often, such approaches place significant emphasis on the ‘model’ component within the CI/AI-based frameworks, primarily concentrating on improving or comparing such models. While this focus is commendable in itself and offers valuable insights, it tends to overlook another pivotal element – the ‘data.’ This dichotomy gives rise to two distinct schools of thought regarding the perception and utilization of hydroinformatics. One approach is predominantly oriented toward the role and structure of models (i.e., models-centric), while an alternative perspective is mostly geared toward the data side of the equation (i.e., data-centric). This paper aimed to delve into the variations between these two schools of thought and argue for the long-term implications of an overreliance on model-centric approaches. Importantly, we explore how the alternative, or perhaps complementary, viewpoint of a data-centric approach can reshape the current paradigm of utilizing CI/AI-based frameworks in the context of water-related sciences.
Model-centric vs. data-centric paradigms
The widespread accessibility of computing power, particularly of cloud computing resources, has led to a substantial increase in the deployment of CI/AI-based models, garnering recognition for their efficacy across various domains. These models have demonstrated noteworthy advantages, featuring significantly reduced computation times and proving effective in addressing real-world challenges. Their applications span diverse fields, ranging from medicine (e.g., Rajpurkar et al., Reference Rajpurkar, Chen, Banerjee and Topol2022) and economics (e.g., Qian et al., Reference Qian, Liu, Shi, Forrest and Yang2023) to water-related issues (e.g., Ray et al., Reference Ray, Verma, Singh, Ganesapillai and Kwon2023). Broadly speaking, one prevailing paradigm emphasizes the model-centric approach, placing a paramount focus on the model aspect of the equation. One of the foundational assumptions underpinning studies that are geared toward the model-centric paradigm is the reliability, correctness, and accessibility of the data used to construct data-driven models. While it can be argued that this assumption has been implicit in all models, including conceptual and physics-based ones, data-driven models take this reliance to a heightened level, where the model’s configuration (i.e., structure and parametrization) and overall performance can significantly vary with different datasets (e.g., Beig Zali et al., Reference Beig Zali, Latifi, Javadi and Farmani2023; Liu et al., Reference Liu, Zhou, Yang, Zhao and Lv2024). This in-built adaptability of data-driven models is not inherently problematic in and of itself, but it raises a more profound question regarding the significance of data availability and data quality. Ultimately, it is essential to note that these models are only as reliable and effective as the data they are fed. Furthermore, their application beyond the confines of research papers depends heavily on the existence of reliable and factual datasets, which, more often than not, are lacking in most practical cases (Li et al., Reference Li, Sun, Wei, Tsourdos and Guo2023).
The solution may seem straightforward – investing in collecting and preparing more reliable and comprehensive datasets, that is, a data-centric approach (DeepLearningAI, 2021; Liu et al., Reference Liu, Savic and Fu2023). The primary distinction between these model-centric and data-centric paradigms lies not in the models themselves but in their perceived role. The model-centric approach seeks to leverage the computational structures of models to generate more accurate and applicable outcomes. In contrast, the data-centric paradigm emphasizes the crucial role of data in obtaining reliable results from such models.
In contrast to the model-centric paradigm, data-centric approaches emphasize the entire data value chain (e.g., data acquisition, analysis, curation, and storage) independently of its application. This allows for leveraging more information from existing datasets and promotes efficiency in expanding such datasets. Consequently, this paradigm prioritizes the data value chain, promoting the efficiency in the use and re-use of datasets. Here, the focus is not on modifying the model’s internal architecture to produce general results but rather on systematically producing and altering datasets and data collection procedures to enhance the overall performance of the models, aiming for accurate and meaningful outcomes. The essence of this paradigm is to facilitate the establishment of a reliable and comprehensive dataset. It advocates for consistent and accurate data collection, coupled with a robust data quality-monitoring scheme tailored to the specific problem at hand. Table 1 summarizes the advantages and disadvantages of model-centric and data-centric paradigms.
Table 1. Comparison of data-centric and model-centric paradigms
The central premise of the data-centric paradigm within the context of water and environmental engineering seems easily obtainable. However, the practical implementation of this idea is far more challenging (Larsen et al., Reference Larsen, Petrovic, Engström, Drews, Liersch, Karlsson and Howells2019; Pandeya et al., Reference Pandeya, Buytaert and Potter2021). Both public and private water and environmental management organizations often face budgetary constraints that hinder their ability to create or acquire such datasets for their projects. This limitation stems from the fact that these endeavors do not immediately translate into revenue generation. The primary objective of prioritizing enhanced data is to establish more robust and dependable models. In the industry, unfortunately, it is often seen that investing in these datasets faces resistance, particularly in smaller organizations, owing to substantial cost and legal implications. In addition to these, larger organizations may also show hesitance due to potential public relations issues that could arise down the road. It is worth noting that real-world data tend to suffer from quality issues and undesirable flaws, such as missing values, erroneous readings, incorrect labels, and anomalies (Zha et al., Reference Zha, Bhat, Lai, Yang, Jiang, Zhong and Hu2023). Improvement of existing datasets and the adoption of data-centric approaches represent a paradigm shift from model design to data quality and reliability.
Another fundamental pillar of data-centric thinking is to move toward smarter data collection rather than an excessive one. Clearly, collecting data can be financially burdensome, and as demonstrated earlier, not without its challenges. Collecting excessive data without a clear idea of their use is arguably more harmful than having fewer data, as this approach drains financial resources that could have otherwise been directed toward better use. Overemphasis on collecting potentially irrelevant data can mislead the modeler and overwhelm the model. Other challenges with using data in data-driven models, for example, unjustified splitting of data into training, validation, and testing of models, indicate the need for educating modelers at the boundary of hydroinformatics, science, and engineering (Wagener et al., Reference Wagener, Savic, Butler, Ahmadian, Arnot, Dawes, Djordjevic, Falconer, Farmani, Ford and Hofman2021). The reason for training individuals who are well-versed in both computer science and a targeted discipline, such as water and environmental engineering, as opposed to pure statisticians and applied mathematicians, is to provide the former group with a more in-depth understanding of the subtleties and nuances of the discipline. This insider knowledge enables them to adopt the most suitable computational model for a given problem. This emphasis on the data itself, characteristic of the data-centric paradigm, rewards investments in the underlying structure of the data over the architecture of the models.
As a final note on this topic, one should remember that while these two paradigms offer opposing viewpoints on leveraging CI/AI-based modeling, it is imperative to recognize their non-mutually exclusive nature, refraining from undermining one another. The fundamental premise is that an accurate, representative, and comprehensive dataset is indispensable for capturing the underlying structure of a phenomenon – a focal point of the data-centric paradigm. Nevertheless, the utility of such data is significantly enhanced when coupled with a reliable model, aligning with the objectives of the model-centric approach. In this context, it is important to emphasize that a sophisticated model does not obviate the need for a thorough and clean dataset. Similarly, focusing on high-quality data does not exempt the necessity of providing a reliable and robust model. In essence, the synergistic interplay between a capable model and a comprehensive dataset is vital to achieve reliable results. Therefore, the optimal perspective on these two paradigms is to appreciate their potential for complementarity, forming a synergistic framework where insights from one paradigm inform and enhance the other, thereby fostering the development of more robust strategies in the context of water and environmental engineering.
Concluding remarks
Due to the rapid development of AI/ML tools (e.g., Large-language models such as ChatGPT), the future of data-driven models, notably ML models, remains uncertain but is extremely exciting. Regardless of the outcomes, it is crucial to shift the perception among engineering professionals and scholars to emphasize the pivotal role of reliable datasets in the broader water industry. The paradigm shifts tend to spotlight the data rather than the models, highlighting the benefit of investing in improving our current datasets and systematically enhancing the data value chain, as opposed to trying to arbitrarily tamper with the model’s architecture to achieve marginal improvements. This should not undermine the benefits of a more capable model; instead, it underscores the idea that a model is only as good and reliable as its input data. Meanwhile, it is equally vital to ensure that the data is collected intelligently, ethically, and accurately, is available to everyone, and safeguards the rights of individuals and other legal entities involved in the process. This is all also addressed by the objectives of the FAIR (Findable, Accessible, Interoperable, Reusable) and SQUARE (Supporting, QUality, Action, and REsearch) data principles (Cudennec et al., Reference Cudennec, Lins, Uhlenbrook and Arheimer2020). Achieving these objectives may necessitate new legislative initiatives and increased investments from the public sector to establish the necessary framework for responsible data collection. Considering the current and future landscape of this field, one can anticipate increased investment, not only from the academic sector but also from the water industry, in furthering data-centric approaches. Additionally, it is hopeful that both public and private companies will increasingly invest in smart data collection and monitoring protocols to ensure that data is not only reliable, but also repetitive, accurate, and readily available to relevant consumers.
Impact statement
In the realm of water and environmental engineering, the data-driven models have gained a lot of traction over the years. While the adoption of these advanced frameworks is an ongoing process, the predominant focus has traditionally centered on refining the models themselves and their internal computational architecture –a perspective encapsulated by the model-centric approach. While these are quite fundamental in reaching a more profound understanding about what these models are capable of, they often overlook a fundamental tenet: The reliability, correctness, and accessibility of the data underpinning these models. An alternative approach, advocating for a paradigm shift, prioritizes elevating data to the forefront. Emphasizing the systematic enhancement of existing datasets and the formulation of frameworks to optimize data collection schemes, this perspective advocates a move toward a more data-centric paradigm in water and environmental engineering. However, this transformative shift is not without its challenges, requiring a nuanced strategy for smart data collection. Equally critical is the ethical and accurate handling of data, ensuring universal availability while upholding the rights of individuals and other legal entities involved in the process. This article underscores the significance of embracing a data-centric perspective, anticipating its far-reaching impact on shaping the future trajectory of water and environmental engineering practices.
Introduction
Data-driven frameworks, including machine-learning (ML) models, have emerged as a prominent focus and a topical subject in various engineering disciplines, notably in the realm of water and environmental engineering (Solomatine and Ostfeld, Reference Solomatine and Ostfeld2008; Giustolisi and Savic, Reference Giustolisi and Savic2009; Araghinejad, Reference Araghinejad2013). Whether it involves a more efficient optimization algorithm (e.g., Jalili et al., Reference Jalili, Najarchi, Shabanlou and Jafarinia2023; Wu et al., Reference Wu, Wang, Hu, Tao and Dong2023), employing meticulous data mining methods (e.g., Aslam et al., Reference Aslam, Maqsoom, Cheema, Ullah, Alharbi and Imran2022; Beig Zali et al., Reference Beig Zali, Latifi, Javadi and Farmani2023; Zolghadr-Asli et al., Reference Zolghadr-Asli, Naghdyzadegan Jahromi, Wan, Enayati, Naghdizadegan Jahromi, Tahmasebi Nasab, Tiefenbacher and Pourghasemi2023), developing sophisticated ML models (e.g., Ray et al., Reference Ray, Verma, Singh, Ganesapillai and Kwon2023; Sun et al., Reference Sun, Zhu, Tan, Li, Li, Deng, Zhang, Liu and Zhu2023), or, more recently, utilizing large-language models such as ChatGPT (e.g., Foroumandi et al., Reference Foroumandi, Moradkhani, Sanchez‐Vila, Singha, Castelletti and Destouni2023; Halloran et al., Reference Halloran, Mhanna and Brunner2023), the core premise of this sub-discipline, often referred to as hydroinformatics within the domain of water and hydrology-related science, lies in the potential of computational intelligence (CI) and, possibly, artificial intelligence (AI) to reshape the future of this field (Makropoulos and Savić, Reference Makropoulos and Savić2019; Loucks, Reference Loucks2023). In essence, hydroinformatics can be viewed as a management philosophy enabled by (CI/AI) technology, and its primary objective is to establish a systematic approach to representing and comprehending the intricate and multidimensional phenomena prevalent in water management. On that note, it is often believed that these technologies hold the promise of offering alternative perspectives on existing challenges, enabling more efficient problem-solving, and devising economically and environmentally sustainable solutions. Some prime examples of this include leakage detection (e.g., Rajasekaran and Kothandaraman, Reference Rajasekaran and Kothandaraman2024), elucidating the underlying causes of abnormal hydro-climatological behaviors (e.g., Zolghadr-Asli et al., Reference Zolghadr-Asli, Naghdyzadegan Jahromi, Wan, Enayati, Naghdizadegan Jahromi, Tahmasebi Nasab, Tiefenbacher and Pourghasemi2023), facilitating a better understanding of the impacts of extreme events such as floods (Adnan et al., Reference Adnan, Siam, Kabir, Kabir, Ahmed, Hassan, Rahman and Dewan2023), and predicting droughts (Piri et al., Reference Piri, Abdolahipour and Keshtegar2023), among others. This subject remains topical, and rapidly evolving, with numerous researchers continually exploring novel approaches to leverage the potential of these frameworks within the context of water-related sciences.
When it comes to water-related challenges, a brief overview of the most current and trending topics in hydroinformatics reveals a significant focus on adopting and fine-tuning sophisticated models (e.g., Bozorg-Haddad et al., Reference Bozorg-Haddad, Latifi, Bozorgi, Rajabi, Naeeni and Loáiciga2018; Yaseen et al., Reference Yaseen, Sulaiman, Deo and Chau2019) and/or comparing the performance of these models (e.g., Chen et al., Reference Chen, Chen, Zhou, Huang, Qi, Shen, Liu, Zuo, Zou, Wang, Zhang, Chen, Chen, Deng and Ren2020; Yaghoubzadeh-Bavandpour et al., Reference Yaghoubzadeh-Bavandpour, Bozorg-Haddad, Rajabi, Zolghadr-Asli and Chu2022), that is, the model-centric approach. In theory, these model-centric efforts have yielded promising results (e.g., Sun and Scanlon, Reference Sun and Scanlon2019; Aliashrafi et al., Reference Aliashrafi, Zhang, Groenewegen and Peleato2021; Ghobadi and Kang, Reference Ghobadi and Kang2023). Often, such approaches place significant emphasis on the ‘model’ component within the CI/AI-based frameworks, primarily concentrating on improving or comparing such models. While this focus is commendable in itself and offers valuable insights, it tends to overlook another pivotal element – the ‘data.’ This dichotomy gives rise to two distinct schools of thought regarding the perception and utilization of hydroinformatics. One approach is predominantly oriented toward the role and structure of models (i.e., models-centric), while an alternative perspective is mostly geared toward the data side of the equation (i.e., data-centric). This paper aimed to delve into the variations between these two schools of thought and argue for the long-term implications of an overreliance on model-centric approaches. Importantly, we explore how the alternative, or perhaps complementary, viewpoint of a data-centric approach can reshape the current paradigm of utilizing CI/AI-based frameworks in the context of water-related sciences.
Model-centric vs. data-centric paradigms
The widespread accessibility of computing power, particularly of cloud computing resources, has led to a substantial increase in the deployment of CI/AI-based models, garnering recognition for their efficacy across various domains. These models have demonstrated noteworthy advantages, featuring significantly reduced computation times and proving effective in addressing real-world challenges. Their applications span diverse fields, ranging from medicine (e.g., Rajpurkar et al., Reference Rajpurkar, Chen, Banerjee and Topol2022) and economics (e.g., Qian et al., Reference Qian, Liu, Shi, Forrest and Yang2023) to water-related issues (e.g., Ray et al., Reference Ray, Verma, Singh, Ganesapillai and Kwon2023). Broadly speaking, one prevailing paradigm emphasizes the model-centric approach, placing a paramount focus on the model aspect of the equation. One of the foundational assumptions underpinning studies that are geared toward the model-centric paradigm is the reliability, correctness, and accessibility of the data used to construct data-driven models. While it can be argued that this assumption has been implicit in all models, including conceptual and physics-based ones, data-driven models take this reliance to a heightened level, where the model’s configuration (i.e., structure and parametrization) and overall performance can significantly vary with different datasets (e.g., Beig Zali et al., Reference Beig Zali, Latifi, Javadi and Farmani2023; Liu et al., Reference Liu, Zhou, Yang, Zhao and Lv2024). This in-built adaptability of data-driven models is not inherently problematic in and of itself, but it raises a more profound question regarding the significance of data availability and data quality. Ultimately, it is essential to note that these models are only as reliable and effective as the data they are fed. Furthermore, their application beyond the confines of research papers depends heavily on the existence of reliable and factual datasets, which, more often than not, are lacking in most practical cases (Li et al., Reference Li, Sun, Wei, Tsourdos and Guo2023).
The solution may seem straightforward – investing in collecting and preparing more reliable and comprehensive datasets, that is, a data-centric approach (DeepLearningAI, 2021; Liu et al., Reference Liu, Savic and Fu2023). The primary distinction between these model-centric and data-centric paradigms lies not in the models themselves but in their perceived role. The model-centric approach seeks to leverage the computational structures of models to generate more accurate and applicable outcomes. In contrast, the data-centric paradigm emphasizes the crucial role of data in obtaining reliable results from such models.
In contrast to the model-centric paradigm, data-centric approaches emphasize the entire data value chain (e.g., data acquisition, analysis, curation, and storage) independently of its application. This allows for leveraging more information from existing datasets and promotes efficiency in expanding such datasets. Consequently, this paradigm prioritizes the data value chain, promoting the efficiency in the use and re-use of datasets. Here, the focus is not on modifying the model’s internal architecture to produce general results but rather on systematically producing and altering datasets and data collection procedures to enhance the overall performance of the models, aiming for accurate and meaningful outcomes. The essence of this paradigm is to facilitate the establishment of a reliable and comprehensive dataset. It advocates for consistent and accurate data collection, coupled with a robust data quality-monitoring scheme tailored to the specific problem at hand. Table 1 summarizes the advantages and disadvantages of model-centric and data-centric paradigms.
Table 1. Comparison of data-centric and model-centric paradigms
The central premise of the data-centric paradigm within the context of water and environmental engineering seems easily obtainable. However, the practical implementation of this idea is far more challenging (Larsen et al., Reference Larsen, Petrovic, Engström, Drews, Liersch, Karlsson and Howells2019; Pandeya et al., Reference Pandeya, Buytaert and Potter2021). Both public and private water and environmental management organizations often face budgetary constraints that hinder their ability to create or acquire such datasets for their projects. This limitation stems from the fact that these endeavors do not immediately translate into revenue generation. The primary objective of prioritizing enhanced data is to establish more robust and dependable models. In the industry, unfortunately, it is often seen that investing in these datasets faces resistance, particularly in smaller organizations, owing to substantial cost and legal implications. In addition to these, larger organizations may also show hesitance due to potential public relations issues that could arise down the road. It is worth noting that real-world data tend to suffer from quality issues and undesirable flaws, such as missing values, erroneous readings, incorrect labels, and anomalies (Zha et al., Reference Zha, Bhat, Lai, Yang, Jiang, Zhong and Hu2023). Improvement of existing datasets and the adoption of data-centric approaches represent a paradigm shift from model design to data quality and reliability.
Another fundamental pillar of data-centric thinking is to move toward smarter data collection rather than an excessive one. Clearly, collecting data can be financially burdensome, and as demonstrated earlier, not without its challenges. Collecting excessive data without a clear idea of their use is arguably more harmful than having fewer data, as this approach drains financial resources that could have otherwise been directed toward better use. Overemphasis on collecting potentially irrelevant data can mislead the modeler and overwhelm the model. Other challenges with using data in data-driven models, for example, unjustified splitting of data into training, validation, and testing of models, indicate the need for educating modelers at the boundary of hydroinformatics, science, and engineering (Wagener et al., Reference Wagener, Savic, Butler, Ahmadian, Arnot, Dawes, Djordjevic, Falconer, Farmani, Ford and Hofman2021). The reason for training individuals who are well-versed in both computer science and a targeted discipline, such as water and environmental engineering, as opposed to pure statisticians and applied mathematicians, is to provide the former group with a more in-depth understanding of the subtleties and nuances of the discipline. This insider knowledge enables them to adopt the most suitable computational model for a given problem. This emphasis on the data itself, characteristic of the data-centric paradigm, rewards investments in the underlying structure of the data over the architecture of the models.
As a final note on this topic, one should remember that while these two paradigms offer opposing viewpoints on leveraging CI/AI-based modeling, it is imperative to recognize their non-mutually exclusive nature, refraining from undermining one another. The fundamental premise is that an accurate, representative, and comprehensive dataset is indispensable for capturing the underlying structure of a phenomenon – a focal point of the data-centric paradigm. Nevertheless, the utility of such data is significantly enhanced when coupled with a reliable model, aligning with the objectives of the model-centric approach. In this context, it is important to emphasize that a sophisticated model does not obviate the need for a thorough and clean dataset. Similarly, focusing on high-quality data does not exempt the necessity of providing a reliable and robust model. In essence, the synergistic interplay between a capable model and a comprehensive dataset is vital to achieve reliable results. Therefore, the optimal perspective on these two paradigms is to appreciate their potential for complementarity, forming a synergistic framework where insights from one paradigm inform and enhance the other, thereby fostering the development of more robust strategies in the context of water and environmental engineering.
Concluding remarks
Due to the rapid development of AI/ML tools (e.g., Large-language models such as ChatGPT), the future of data-driven models, notably ML models, remains uncertain but is extremely exciting. Regardless of the outcomes, it is crucial to shift the perception among engineering professionals and scholars to emphasize the pivotal role of reliable datasets in the broader water industry. The paradigm shifts tend to spotlight the data rather than the models, highlighting the benefit of investing in improving our current datasets and systematically enhancing the data value chain, as opposed to trying to arbitrarily tamper with the model’s architecture to achieve marginal improvements. This should not undermine the benefits of a more capable model; instead, it underscores the idea that a model is only as good and reliable as its input data. Meanwhile, it is equally vital to ensure that the data is collected intelligently, ethically, and accurately, is available to everyone, and safeguards the rights of individuals and other legal entities involved in the process. This is all also addressed by the objectives of the FAIR (Findable, Accessible, Interoperable, Reusable) and SQUARE (Supporting, QUality, Action, and REsearch) data principles (Cudennec et al., Reference Cudennec, Lins, Uhlenbrook and Arheimer2020). Achieving these objectives may necessitate new legislative initiatives and increased investments from the public sector to establish the necessary framework for responsible data collection. Considering the current and future landscape of this field, one can anticipate increased investment, not only from the academic sector but also from the water industry, in furthering data-centric approaches. Additionally, it is hopeful that both public and private companies will increasingly invest in smart data collection and monitoring protocols to ensure that data is not only reliable, but also repetitive, accurate, and readily available to relevant consumers.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/wat.2024.5.
Data availability statement
All used data have been presented in the paper.
Author contribution
All authors have contributed equally to the conceptualization of the paper.
Financial support
Dragan Savic has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. (951424)).
Competing interest
The authors have no relevant financial or non-financial interests to disclose.