Policy Significance Statement
The significance of this article for policymakers is shown in three aspects. First, it provides an overview of the challenges associated with privacy, availability, and applicability for predictive simulations of the data that are being used for city digital twins. Second, it discusses how the application of synthetic data can address these challenges based on the examples from other domains of application. Third, it proposes a new methodology for urban mobility data generation, which can address the aforementioned challenges and serve as a tool for citizen engagement.
1. Introduction
The growing complexity of the world requires new analytical tools to be deployed for harnessing the potential benefits of evidence-based policy. City digital twins (CDTs) create opportunities for city officials to embrace the notion of simulation governance and expand the reach of contemporary planning techniques. The simulative notion of the digital twin (DT) allows it to be turned into something more than a digital copy of the city, as the virtual nature of this approach to modeling provides the necessary infrastructure for a wider spectrum of imaginative ideas to be applied to the design of smart cities.
However, while the spectrum of possible applications of this technology is vast, especially in the context of policymaking where numerous scenarios can be tested in a virtual environment of a DT before the actual policy intervention is conducted, the empirical research on the applications of this technological tool to policy domains is scarce. There are some prominent examples of empirical research in this field (Nochta et al., Reference Nochta, Wan, Schooling and Parlikad2021), but overall, the challenges of utilizing CDTs in the process of policymaking from a less engineering-oriented perspective are rarely discussed in the academic literature.
This article discusses the technology of CDTs and their potential utilization by policymakers. The article also provides some critical remarks which need to be taken into account when dealing with CDTs, as well as proposes a methodology for data generation, which can be applicable for CDTs. The article proceeds as follows.
The first part of the article discusses the history and conceptual underpinnings of this technology and analyzes the capabilities of one of the most advanced projects in the field—Virtual Singapore. The second part discusses the limitations of the current generation of CDTs. Concerns related to data privacy, availability, and their applicability for predictive simulations are discussed (Bektas and Schumann, Reference Bektas and Schumann2019; Kieu et al., Reference Kieu, Malleson and Heppenstall2020) alongside the role that synthetic data can play in addressing these concerns (Nikolenko, Reference Nikolenko2019). The final section of the article proposes an alternative task-based approach to urban mobility data generation. This approach is informed by the practices of data labeling, urban data games, and games with a purpose. Under this proposal, the city authorities can establish services responsible for asking people to conduct certain activities in an urban environment in order to create data for possible policy interventions for which there does not exist useful historical data. This can potentially allow the governments to collect unique data about possible behavioral responses to certain interventions, while simultaneously not violating any privacy concerns as the data generated through this approach will not be representative of any single individual.
2. CDTs of Current Generation
2.1. DT concept
A DT is a virtual representation of the characteristics and behaviors of a physical object. The purpose of the creation of a DT is to model and predict the lifecycle of a system (Jones et al., Reference Jones, Snider, Nassehi, Yon and Hicks2020). The concept of DT originated in the works of Michael Grieves and John Vickers at NASA in 2003 (Grieves and Vickers, Reference Grieves, Vickers, Kahlen, Flumerfelt and Alves2017). It was not very specific at the time with only three key characteristics of the concept being articulated: physical product, virtual product, and connections between them (Tao et al., Reference Tao, Zhang, Liu and Nee2019; Jones et al., Reference Jones, Snider, Nassehi, Yon and Hicks2020).
However, the practice of creating a duplicate of a system has been explored and practised at NASA even earlier starting from the 1960s, when the so-called “first digital twin” (Boschert and Rosen, Reference Boschert, Rosen, Hehenberger and Bradley2016; Enders and Hoßbach, Reference Enders and Hoßbach2019) was created to simulate the conditions on board of Apollo 13 in real-time. It was the first example when a duplicate of a real system was used to mimic the system’s conditions in real-time to avoid the failure of the mission (Ferguson, Reference Ferguson2020). In 2012, the concept of DT was redefined at NASA to be understood as an integrated multiphysics, multiscale, probabilistic simulation of a system, which mirrors the life of a corresponding physical artifact based on historical data, physical model, and real-time sensing (Glaessgen and Stargel, Reference Glaessgen and Stargel2012).
The concept of DT has become a popular research topic (Tao et al., Reference Tao, Zhang, Liu and Nee2019) with different approaches to the issue. Tao and Zhang (Reference Tao and Zhang2017) propose to expand the three-dimensional model of a DT comprising of a physical part, virtual part, and connections between them with data and service. Despite the debates about the definition of DTs, the core concept behind it is a system that couples physical and virtual entities of a system to leverage the benefits from both (Jones et al., Reference Jones, Snider, Nassehi, Yon and Hicks2020). The main feature of this approach is the ability to predict the future state of a system when tested under different conditions based on data-driven analytics and simulations in real or hypothetical conditions (Cioara et al., Reference Cioara, Anghel, Antal, Salomie, Antal and Ioan2021).
According to Grieves and Vickers (Reference Grieves, Vickers, Kahlen, Flumerfelt and Alves2017), if applying the model of DTs, then the creation of a physical object starts in the virtual space through the creation of a DT prototype. Once the modeling and simulations of a system are conducted and potential obstacles are understood, the physical object is being built, and the data connections between the virtual and the physical domains are established. DTs enable control over physical entities via digital objects without human intervention (Enders and Hoßbach, Reference Enders and Hoßbach2019). This is a unique feature of this approach, which is different from a digital shadow—when a change in the physical object causes a change in the virtual object, but not vice versa (Kritzinger et al., Reference Kritzinger, Karner, Traar, Henjes and Sihn2018).
DTs have gained popularity in the manufacturing literature (Malik, Reference Malik2021; Rožanec et al., Reference Rožanec, Lu, Rupnik, Škrjanc, Mladenić, Fortuna, Zheng and Kiritsis2021) due to their ability for simulation, which is argued by some researchers to be the key feature of this approach (Kuts et al., Reference Kuts, Otto, Tähemaa and Bondarenko2019). DTs are also widely used in the aerospace engineering industry (Ríos et al., Reference Ríos, Hernandez-Matias, Oliva and Mas2015) because they can replicate extreme conditions which cannot be physically performed in the laboratory setting (Sharma et al., Reference Sharma, Kosasih, Zhang, Brintrup and Calinescu2020). A similar approach is exercised in the automotive industry, where DTs which comprise the whole car, its mechanics, electrics, software, and physical behavior are used to simulate most steps in the product’s lifecycle in a virtual environment without the need to build numerous physical prototypes (Dharmani and Lulla, Reference Dharmani and Lulla2020). After the tests are over, the physical prototype is built and the connections between the model and the physical object are established. Other projects include the creation of a behavioral DT of a driver, through which warning and instructions about safer driving are provided in real-time (Chen et al., Reference Chen, Kang, Shiraishi, Preciado and Jiang2018).
With the advancements of this approach, it is now applied to the modeling of more complex systems. There are attempts to apply DTs in a smart energy grid; however, replicating the behavior of a physical energy asset is still problematic (Cioara et al., Reference Cioara, Anghel, Antal, Salomie, Antal and Ioan2021). This approach is also being applied to socio-technical systems such as manufacturing plants (Park et al., Reference Park, Easwaran and Andalam2019), where, for example, the creation of the DT for the whole plant can be the next step in the development of the DT for the artifact (such as a vehicle in the automotive industry) (Biesinger and Weyrich, Reference Biesinger and Weyrich2019). Simulations allow for making insights into complex manufacturing systems in the virtual domain before translating them into the real world (Mourtzis, Reference Mourtzis2020). Computer-rendered avatars of humans are also used in these simulations in order to test their ergonomic viability (Malik, Reference Malik2021). Livestock farming is another domain where a proposal for the utilization of DTs has been made, as the installation of sensors for measuring the environmental factors in real-time can be applied for simulations based on which decisions about the optimal CO2 and temperature levels can be made in the near-real-time (Jo et al., Reference Jo, Park, Park and Kim2018). This approach has been applied to even more complex socio-technical systems such as cities—arguably one of the most complex artifacts created by humans.
2.2. City DTs
The growth of computing capabilities, ubiquitous data flows and general interest in the application of DTs has expanded the application of this methodology from modeling of physical objects to the modeling of complex socio-technical systems. As this approach has gained prominence, DTs of smart cities are now created in order to analyze the data about urban complexities across time and scale (Francisco et al., Reference Francisco, Mohammadi and Taylor2020).
CDTs are a virtual replica of a city, which is continuously informed by the processes that take place in a real city via real-time data connections (Mohammadi and Taylor, Reference Mohammadi and Taylor2017). CDTs can also be realized as systems of interconnected twins created on the scale of a block or a district, with a large number of observations interacting in the mathematical model—due to the volume and unknown relationships between the data, machine learning approaches can be useful for simulations (Ruohomäki et al., Reference Ruohomäki, Airaksinen, Huuska, Kesäniemi, Martikka and Suomisto2018). Pilot projects of smart CDTs are being developed in numerous cities across the globe, including Singapore, Glasgow, Helsinki, and Boston, with Helsinki expanding this notion to provide virtual tourism services with the usage of Virtual Reality technologies. The approach introduced by Grieves and Vickers (Reference Grieves, Vickers, Kahlen, Flumerfelt and Alves2017) is being utilized in a newly built Indian city, Amaravati, which is the first attempt to build a city starting from building its DT—thus, allowing everything to be modeled and simulated before embarking on building it (smartcitiesworld.com, 2020).
The academic discussion about CDTs has also been vibrant recently, which can be attributed to the growing urbanization rates and the rise of the Internet of Things (IoT) and data analytics (Fuller et al., Reference Fuller, Fan, Day and Barlow2020), as well as the need for the introduction of digital services to the city management (Soe, Reference Soe2017). The creation of CDTs is dependent on the availability of massive urban data generated every day via different sensors installed in the city and technical foundation, such as IoT and 5G (Deren et al., Reference Deren, Wenbo and Zhenfeng2021). Deng et al. (Reference Deng, Zhang and Shen2021) argue that the creation of DTs is an inevitable goal of digital transformation of a city, which is countered by Nochta et al. (Reference Nochta, Wan, Schooling and Parlikad2021) who argue that it is too early to argue that CDTs bring a paradigm shift to urban modeling.
Deren et al. (Reference Deren, Wenbo and Zhenfeng2021) argue that the four major characteristics of CDTs include accurate mapping (modeling of the physical aspects of the city), virtual–real interaction (aspects of the physical environment can be observed in a virtual environment), software definition (simulations are conducted on software platforms), and intelligent feedback (early warning of potential adverse effects and potential dangers). A systematic analysis of the CDT literature reveals that the potential for its applications can be broadly divided into five themes: data management, visualization, situational awareness, planning and prediction, and integration and collaboration (Shahat et al., Reference Shahat, Hyun and Yeom2021).
In this typology, data management becomes the crucial pillar for the development of the CDT, as the ability to manage and process data coming from different sources in the city is dependent on the ability to integrate this data through various data standardization and data sharing programs (Ruohomäki et al., Reference Ruohomäki, Airaksinen, Huuska, Kesäniemi, Martikka and Suomisto2018). The visualization of the city processes is another key feature of CDTs, where visualization of social processes is among the key challenges (Shahat et al., Reference Shahat, Hyun and Yeom2021). The situational analysis allows for the monitoring and analysis of the activities taking place in the city, which include monitoring of the health of the citizens based on the data from personal health devices (Laamarti et al., Reference Laamarti, Badawi, Ding, Arafsha, Hafidh and Saddik2020), monitor daily building energy consumption (Francisco et al., Reference Francisco, Mohammadi and Taylor2020), or detect motion (Dou et al., Reference Dou, Zhang, Zhao, Wang, Xiong and Zuo2020). There are also experiments about how to use CDTs in the process of planning future city operations scenarios, for example, traffic behavior in the city under different conditions (Dembski et al., Reference Dembski, Wössner and Yamu2019). Integration of different aspects of CDTs in one platform presents a challenge in the field (Shahat et al., Reference Shahat, Hyun and Yeom2021), with some proposals arguing that the development of several DT models for one city can be more feasible than developing a single model to capture all the complexities of the city’s operations (Wan et al., Reference Wan, Nochta and Schooling2019). Other approaches include a combination of DTs of different scale built for various purposes with different approaches for capturing the complexities of a nation’s infrastructure (Lu et al., Reference Lu, Parlikad, Woodall, Don Ranasinghe, Xie, Liang, Konstantinou, Heaton and Schooling2020).
The application of crowdsourced visual data for the estimation of the geospatial information of vulnerable objects in the cities can be integrated with the 3D virtual city model for immersive visualization, which can be used for making more informed decisions about infrastructure management and run what-if simulations for disaster situations (Ham and Kim, Reference Ham and Kim2020). CDTs can be used as a tool to run simulations for different disaster management scenarios, where based on the data collected from the sensors installed in the city a realistic reaction of a physical space to a certain natural disaster can be simulated, which can be combined with natural language processing-enabled system for monitoring of the activities on social media (Dogan et al., Reference Dogan, Sahin and Karaarslan2021). However, in times of urgent crises such as the Covid-19 pandemic, high-quality long-term data crucial for CDTs is unavailable. A collaborative CDT model based on a federated learning methodology, when different CDTs learn a shared model while keeping all the training data locally, can be leveraged in situations of emergency (Pang et al., Reference Pang, Li, Xie, Huang and Cai2020). A model for the incorporation of the mental features of decision-making in CDTs has been proposed (Klebanov et al., Reference Klebanov, Antropov and Zvereva2019), as well as a proposal for integration of artificial intelligence (AI) algorithms for situation assessment in emergency situations for disaster management (Fan et al., Reference Fan, Zhang, Yahja and Mostafavi2021).
The technological approach utilized in the CDTs allows for the integration of real-time data into a 3D model of a city (Shahat et al., Reference Shahat, Hyun and Yeom2021) which can be used for the analysis of various city domains (Dou et al., Reference Dou, Zhang, Zhao, Wang, Xiong and Zuo2020). The integration of the real-time data flows into the 3D can be used for tracking the widespread of information and determine vulnerable objects during disasters (Ham and Kim, Reference Ham and Kim2020; Fan et al., Reference Fan, Zhang, Yahja and Mostafavi2021). This type of information can inform city officials about what areas to evacuate first (White et al., Reference White, Zink, Codecá and Clarke2021). Another domain where real-time data inflow can be utilized is traffic control and congestion pricing, though this approach is prone to the privacy critique (Bliss, Reference Bliss2019). Near real-time efficiency of a building in terms of energy consumption can be used to determine variations from conventional benchmarks for real-time energy management (Francisco et al., Reference Francisco, Mohammadi and Taylor2020). Certain difficulties still exist in this domain, however, as the ability of a CDT to be updated in near real-time from the received information from the physical object has been proven, but the inverse connection and data transfer from the virtual to the physical are still challenging (Shahat et al., Reference Shahat, Hyun and Yeom2021).
Although CDTs are positioned to grow from experimental playgrounds for city planners (Hemetsberger, Reference Hemetsberger2020) to an effective tool for city management (Wong et al., Reference Wong, Mo, Shieh and Ee Sim2020), currently, CDTs only abstract a small fraction of the processes that shape the way in which the social and economic functions of the city work (Batty, Reference Batty2018). Nochta et al. (Reference Nochta, Wan, Schooling and Parlikad2021) argue that CDTs can be useful at the early stages of policymaking, where they can be utilized for identifying inconsistencies between sectorial policies, supporting interdisciplinary policy design, and improving the efficiency and effectiveness of modeling. Other applications of CDTs in policymaking support include simulations of traffic scenarios (Dembski et al., Reference Dembski, Wössner and Yamu2019), identification of vulnerable physical objects in the city under “what if” simulations of disasters (Ham and Kim, Reference Ham and Kim2020), and measuring effects of potential urban planning interventions on climate (Schrotter and Hürzeler, Reference Schrotter and Hürzeler2020).
2.3. Virtual Singapore
Virtual Singapore serves as a good example in the discussion about CDTs, as it is one of the most advanced projects in the field (Geddie and Aravindan, Reference Geddie and Aravindan2018). The project was granted $73 million by the National Research Foundation of Singapore and was developed by the French software company Dassault Systemes. It is a dynamic 3D model of the whole city of Singapore as well as the necessary technical infrastructure to turn it into a collaborative data platform (National Research Foundation Singapore, 2018). Other stakeholders involved in the project include Singapore Land Authority and Government Technology Agency (Nativi et al., Reference Nativi, Delipetrev and Craglia2020), with the former providing static data about the aboveground structures in the city (Guerrini, Reference Guerrini2016).
The static data from the governmental agencies are not limited to the 3D models of the aboveground structures in the city. They also capture detailed information about the built environment of the city up to the scale of the building, which includes information about its geometry and materials it is made of (Koçer, Reference Koçer2020), as well as the data about the flora of the island (Gobeawan et al., Reference Gobeawan, Lin, Tandon, Yee, Khoo, Teo, Yi, Lim, Wong, Wise, Cheng, Liew, Huang, Li, Teo, Fekete and Poto2018). The data ecosystem also includes real-time data which is being obtained via IoT sensors deployed in the city (Guerrini, Reference Guerrini2016). The data generated by other city-scale platforms, such as OpenMap Singapore, are also added in the DT model of the city (National Research Foundation Singapore, 2018). The CDT has recently been updated by the data about underground structures of the city, with this part of the project being titled “Digital Underground” (Yan et al., Reference Yan, Jaw, Soon, Wieser and Schrotter2019, Reference Yan, Van Son and Soon2021). With the addition of the data about the belowground systems of the city, the CDT of Singapore now includes 20 core datasets: 3D airspace, vegetation, 3D road, 3D building models, cadastre, land use, administrative bodies, waterbody, geodetic control, orthophoto, 3D coastline, digital terrain model, point cloud, 3D reality mesh, imagery, positioning infrastructure, 3D address, building information modeling (BIM), 3D underground asset, and 3D geology (Schrotter, Reference Schrotter2020).
Being a 3D model of the city, in comparison to the usual 2D models, Virtual Singapore is not only able to incorporate more data layers in the model, including the data about water bodies, vegetation, and transportation infrastructure, but also capable of showing the information about the curbs, stairs, or the steepness of the hill in the city (Dassault Systemes, 2018). Possible use cases of such information are identification of barrier-free routes for the elderly and disabled people, sharing of the information about wild animals in the city, tracking elderly people who have dementia, and detecting frequently used paths and spots in the city (Singapore Land Authority, 2014).
The applications of the Virtual Singapore model can include the visualization of the 3G/4G internet coverage areas in the city in order to determine which parts of the city suffer from poor connection; agent-based simulations of crowd dispersion, pedestrian movements, and transport flows; collaborative design of the pathways around new amenities in the city—the way such pathways will be built will influence the pedestrian flows, which can be simulated in the DT environment to determine the right intervention; and examination of the most efficient way of installing solar panels on the roofs to achieve higher energy production (National Research Foundation Singapore, 2018).
The project positions itself as a collaborative tool where different stakeholders can potentially conduct virtual experiments in the urban environment (National Research Foundation Singapore, 2018). However, some officials consider the system to be too dangerous to allow everyone to experiment with it, because, for example, militants or terrorists can use the information about the height, location, and the view from the buildings to plan their attacks (Geddie and Aravindan, Reference Geddie and Aravindan2018). Due to these reasons, Virtual Singapore will not be connected to the worldwide web, and the model has not yet been made publicly available, so the citizens are unable to interact with it (Geddie and Aravindan, Reference Geddie and Aravindan2018; White et al., Reference White, Zink, Codecá and Clarke2021).
A common critique for the CDTs like Virtual Singapore and the simulations that can be conducted in these environments is their overreliance on historical data, which limits its predictive capabilities for the extreme situations for which there are no data available (GeoTwin, 2020)—which is the case for the Virtual Singapore project (Holstein, Reference Holstein2015). Other critical takes on the current generation of CDTs include lack of mechanisms for citizen engagement, interaction, and feedback report, as well as underutilization of urban mobility data (White et al., Reference White, Zink, Codecá and Clarke2021). Concerns related to the privacy protection of the micro-level data about individuals used for agent-based simulations in CDTs are also voiced (Bektas and Schumann, Reference Bektas and Schumann2019; Bliss, Reference Bliss2019).
3. Shortcomings of the Current Generation of DTs
3.1. Data for DTs
With the digitalization of cities, different types of data can be generated within the urban boundaries (White et al., Reference White, Zink, Codecá and Clarke2021), providing detailed information about transportation (Menouar et al., Reference Menouar, Guvenc, Akkaya, Uluagac, Kadri and Tuncer2017), water supply (Parra et al., Reference Parra, Sendra, Lloret and Bosch2015), waste management (Medvedev et al., Reference Medvedev, Fedchenkov, Zaslavsky, Anagnostopoulos, Khoruzhnikov, Balandin, Andreev and Koucheryavy2015), and power generation (Oldenbroek et al., Reference Oldenbroek, Verhoef and van Wijk2017). However, meaningful knowledge discovery from such distinctively different data remains problematic due to issues of noise and missing values in the data or linking structured and unstructured data (Mohammadi and Taylor, Reference Mohammadi and Taylor2020). The nonexistence of certain types of data (Hemetsberger, Reference Hemetsberger2020), the lack of data standard, and the unwillingness to share the data by different institutions are also challenges that may slow down the process of the development of CDTs (Wong et al., Reference Wong, Mo, Shieh and Ee Sim2020).
According to Wong et al. (Reference Wong, Mo, Shieh and Ee Sim2020), the data that can be integrated into the CDT model can be represented as a 2 × 2 matrix, with the 2D/3D and Dynamic/Static dichotomies. In such a model, 2D dynamic data include navigation, remote facility monitoring, pandemic management and tracking, crowd and traffic control, climate diagram, and city dashboards. The 2D static data include maps, administrative boundary, and building plans. 3D dynamic data include navigation (3D), building management system, maintenance of underground pipes, microclimate and airflow analysis, command and control center/crisis management, whereas 3D static data include building information modeling, geographic information system, and design visualization (interior/venue).
Fuller et al. (Reference Fuller, Fan, Day and Barlow2020) argue that the number and quality of connection of IoT devices installed in the city are crucial for gathering enough relevant data for CDTs. Although feeding a lot of different data (Big Data) into the DT and then applying machine learning techniques for making predictions is considered to be a viable way of utilizing the power of IoT in the DT (Qi and Tao, Reference Qi and Tao2018), some researchers question these techniques because some level of verification of the validity of the prediction is required, which is not always possible (Boje et al., Reference Boje, Guerriero, Kubicki and Rezgui2020).
The high demand for city data from CDTs raises concerns about the safety and privacy of the data because certain applications such as real-time traffic flow simulations would require an excessive amount of individualized information (Bliss, Reference Bliss2019). Although not all CDTs of the current generation include urban mobility data (White et al., Reference White, Zink, Codecá and Clarke2021), the potential issues related to data aggregation infrastructure on the city-level pose significant concerns among various stakeholders, affecting the trust levels in the technology (Ismagilova et al., Reference Ismagilova, Hughes, Rana and Dwivedi2020).
3.2. Human aspects of CDTs
The digitalization of city services and ubiquitous computing provides numerous channels through which up-to-date human mobility data can be aggregated at various temporal and spatial scales (Luca et al., Reference Luca, Barlacchi, Lepri and Pappalardo2020). The ways in which mobility data can be aggregated include GPS trackers embedded in mobile phones (Zheng et al., Reference Zheng, Xie and Ma2010) or vehicles (Pappalardo et al., Reference Pappalardo, Rinzivillo, Qu, Pedreschi and Giannotti2013), the connection of the mobile phone to the cellular network (González et al., Reference González, Hidalgo and Barabási2008), and geo-tagged posts on social media (Blanford et al., Reference Blanford, Huang, Savelyev and and MacEachren2015).
There are numerous research projects that focus on the mining of the human mobility trajectory data (Jiang et al., Reference Jiang, Fiore, Yang, Ferreira, Frazzoli and González2013) and finding statistical patterns in it (Barbosa-Filho et al., Reference Barbosa-Filho, Barthelemy, Ghoshal, James, Lenormand, Louail, Menezes, Ramasco, Simini and Tomasini2017). An abundance of this type of data allows for the application of new techniques to its analysis in order to predict the next location an individual will visit based on the historical data (Burbey and Martin, Reference Burbey and Martin2012; Wu et al., Reference Wu, Zhou, Zhao, Yue and Keutzer2018) or forecast the flows of people on a geographic region (Ebrahimpour et al., Reference Ebrahimpour, Wan, Cervantes, Luo and Ullah2019). The applications of these methodologies for policymaking are vast and include potential public emergency detection, public safety and traffic management, and land use management (Luca et al., Reference Luca, Barlacchi, Lepri and Pappalardo2020).
While most DT projects replicate physical environments and systems, there are projects that are trying to replicate human cognitive processes in urban environments (Du et al., Reference Du, Zhu, Shi, Wang, Lin and Zhao2020). Such projects are attempting to invent methodologies that would allow modelers to create realistic simulations of human behaviors—create DTs of humans. However, capturing human behavior via IoT sensor deployed in the city can be a challenging process, from both ethical and technical points of view, because the behavior of a human DT needs to be based on user feedback and recorded patterns, not simply on measured data (Graessler and Poehler, Reference Graessler and Poehler2017). In this light, the Covid-19 pandemic experience opens a forum for a heated discussion, because fine-grained data recorded by the contact-tracing applications provides a perfect repository of behavioral data based on which human behavior can be modeled, yet the ethicality and the ownership right domains of this issue make the usage of these data complicated.
The notion of human DTs is mainly discussed in the healthcare literature (Liu et al., Reference Liu, Zhang, Yang, Zhou, Ren, Wang, Liu, Pang and Deen2019), where human DTs are understood as “representations of an individual that dynamically reflect molecular status, psychological status, and lifestyle over time” (Bruynseels et al., Reference Bruynseels, Santoni de Sio and van den Hoven2018). Some pilot projects have focused on creating human DTs with the usage of wearable fitness bracelets SmartFit, through which the behavioral data about activities, food consumption, and mood has been aggregated (Barricelli et al., Reference Barricelli, Casiraghi, Gliozzo, Petrini and Valtolina2020). Other work has focused on employing human DTs in industrial settings (Amenyo, Reference Amenyo2018; Sparrow et al., Reference Sparrow, Kruger and Basson2019).
Another proposal for the human DT framework is focused on human–computer interaction (Hafez, Reference Hafez, Bi, Bhatia and Kapoor2020), where the human DT is understood as a meta-model that navigates the behavioral patterns of human interaction with numerous smart machines. The focus on mimicking behavioral patterns of humans can become another milestone in human DT creation as it can potentially help to not only model anatomical and physiological processes but also enrich such models with the addition of cognitive simulations (Kawamura, Reference Kawamura2019).
Urban mobility simulation is a part of the current generation of CDTs (Dembski et al., Reference Dembski, Wössner, Letzgus, Ruddat and Yamu2020), where agent-based models are being used to simulate the mobility behavior of city inhabitants based on mobile phone data (Wu et al., Reference Wu, Liu, Yu, Peng, Jiao and Niu2019). As mentioned before, the human mobility data come via three main channels: mobile phone or vehicle GPS data, cellular network connection data, or posts from social media with geo-tags (Luca et al., Reference Luca, Barlacchi, Lepri and Pappalardo2020). Agent-based modeling of human mobility in the city requires fine-grained micro-level data for input; however, this type of data is often not available due to numerous reasons including privacy concerns (Bektas and Schumann, Reference Bektas and Schumann2019). Another drawback of the agent-based modeling approaches is that they are not properly predictive (GeoTwin, 2020) and unable to incorporate real-time data flows for short-term predictions (Kieu et al., Reference Kieu, Malleson and Heppenstall2020).
Another way to gather the human mobility data is to deploy IoT sensors in the city, which will allow for the passive collection of massive amounts of data (Rathore et al., Reference Rathore, Paul, Hong, Seo, Awan and Saeed2018). This type of data would not be representative of individual travel behaviors, rather it would provide a continuously updated travel behavior of the inhabitants of the whole region of a city (Lim et al., Reference Lim, Kim and Maglio2018). The typical way to analyze urban Big Data is to deploy machine learning, particularly deep learning, approaches (Toch et al., Reference Toch, Lerner, Ben Zion and Ben-Gal2019). Machine learning approaches, therefore, have very strong predictive capabilities as they tend to learn the general rules of the data and derive predictions from that (Ebel et al., Reference Ebel, Göl, Lingenfelder and Vogelsang2020). The natural limitation of this approach, however, is that these models may not suit well for predicting the situations which have not occurred before, because of the features of the historical data that the model has been trained on (GeoTwin, 2020).
Thus, despite the abundance of urban Big Data, the predictive power of machine learning models when trained on historical data is limited by the events that have occurred before, which makes it not applicable for certain situations, when a completely new intervention or an emergency situation is being simulated. Agent-based modeling approaches, while performing better at providing plausible scenarios for situations that have not occurred before, still are far less effective at their predictive power, as well as require sensitive data for its simulations.
4. The Potential of Synthetic Data Application in CDTs
One way of approaching this challenge can be found in generating synthetic data. In broad understanding, synthetic data are artificial data, which have characteristics similar to real data. Mainly used in cases where the real data are sensitive, it can also be used for replacing missing data and augmentation of artificial data with real data (Kaloskampis, Reference Kaloskampis2019). With vast possible applications of this technology, currently, it is being utilized in order to generate useful private data without violating the data privacy laws (Bellovin et al., Reference Bellovin, Dutta and Reitinger2019).
Fully synthetic data can be understood as artificial data that are statistically similar to the original data (Park et al., Reference Park, Mohammadi, Gorde, Jajodia, Park and Kim2018), but the new data have no identifiable information about their origin Dankar and Ibrahim (Reference Dankar and Ibrahim2021). This approach breaks the links between the original data and the new synthetic data so that reidentification is not meaningful anymore (Taub et al., Reference Taub, Elliot, Pampaka, Smith, Domingo-Ferrer and Montes2018). This approach is believed to be secure for individual privacy, as it does not map back to the information about real individuals (Hu, Reference Hu2018).
While real behavior data are valuable and sensitive, in many domains, this type of data is unavailable, which makes the usage of synthetic data solutions attractive (Nikolenko, Reference Nikolenko2019). In most cases, the models trained on the synthetic data are almost as effective as models trained on the real data (Hittmeir et al., Reference Hittmeir, Ekelhart and Mayer2019). Synthetic data approaches are popular in healthcare, because of the obvious concerns for data sensitivity and privacy, where the models trained on synthetic data show only small decreases in accuracy when compared to the models trained on the real data (Rankin et al., Reference Rankin, Black, Bond, Wallace, Mulvenna and Epelde2020).
The general working logic of the process of synthetic data generation using available synthetic data generators is that the generator takes a dataset with private information as an input, constructs a statistical model of the statistical properties of the data, and then uses this model to generate synthetic datasets that are statistically similar to the original dataset but have no identifiable information from the original data (Dankar and Ibrahim, Reference Dankar and Ibrahim2021).
Another key feature of synthetic data is that it allows the creation of data about hypothetical events that have not yet happened in real life (Nikolenko, Reference Nikolenko2019). This approach is currently used in the autonomous vehicle industry, where the developers are testing very unlikely yet possible scenarios that a vehicle can encounter in the city with the usage of synthetic data. For example, “a reflective flatbed crossing a highway at dusk, with the sun’s glare rendering it unintelligible to visual sensors trained only on daylight at noon” (Atherton, Reference Atherton2019). Other understanding of the synthetic data usage for machine learning algorithms training in the domain of autonomous vehicles includes aggregating the data from computer games in order to combine it with the real data, which helps to save both money and time for the developers (Richter et al., Reference Richter, Vineet, Roth, Koltun, Leibe, Matas, Sebe and Welling2016).
Learning to drive a vehicle is a challenging process, which requires the application of reinforcement learning (RL) techniques so that an agent can learn from interacting with the environment—making real-life experiments in this domain expensive and impractical (Nikolenko, Reference Nikolenko2019). This problem is actively tackled with the usage of synthetic data, where the models are being trained in computer-generated 3D environments instead of the real world. Importantly, the performance difference between the algorithms trained in the real world and in the virtual world are almost nonexistent (Gaidon et al., Reference Gaidon, Wang, Cabon and Vig2016).
Johnson-Roberson et al. (Reference Johnson-Roberson, Barto, Mehta, Sridhar, Rosaen and Vasudevan2017) used the data from a video game Grand Theft Auto 5 with sufficiently realistic graphics in order to train the model on synthetic data from scratch, which allowed them to make a model trained in a rich virtual world to recognize and classify real objects using synthetic data. A similar approach has been exercised by Ros et al. (Reference Ros, Sellart, Materzynska, Vazquez and Lopez2016), who recreated New York City in Unity platform and then rendered more than 200,000 synthetic images of the city, which can be used for the autonomous vehicle training. This approach has been further applied to other cities, for example, San Francisco (Hernandez-Juarez et al., Reference Hernandez-Juarez, Schneider, Espinosa, Vázquez, López, Franke, Pollefeys and Moure2017). Li et al. (Reference Li, Li, You and Barnes2017) used the same approach in order to create a dataset of synthetic images of foggy conditions in the city.
Datasets with 3D synthetic data of outdoor environments are less common (Nikolenko, Reference Nikolenko2019); however, some attempts of adding a LiDAR simulation into the Grand Theft Auto V video game have been made, and synthetic data was generated via this method (Wu et al., Reference Wu, Ning, Chakraborty, Vreeken, Tatti and Ramakrishnan2018). A similar approach has been exercised by Forensic Architecture (2018), where the synthetically generated images of a tank in different environments were created in order to train a machine learning model which was supposed to look for tank appearances in videos posted on social media—a sign of a potential human rights violation.
Video games have been historically used as a prominent source of virtual environments for the development of RL and other AI techniques (Schaul et al., Reference Schaul, Togelius and Schmidhuber2011; Justesen et al., Reference Justesen, Bontrager, Togelius and Risi2019). Games like Doom (Bhatti et al., Reference Bhatti, Desmaison, Miksik, Nardelli, Siddharth and Torr2016) and Minecraft (Oh et al., Reference Oh, Chockalingam, Singh and Lee2016) have been used to train the algorithms for robotic navigation in complex synthetic environments (Nikolenko, Reference Nikolenko2019). Real-time strategy games, such as StarCraft, have also been previously used for the generation of synthetic datasets (Lin et al., Reference Lin, Gehring, Khalidov and Synnaeve2017), while autonomous driving agents have been trained in different racing game simulators (Sulkowski et al., Reference Sulkowski, Bugiel and Izydorczyk2018).
However, the agents that are being trained in the virtual environments of video games usually are not transferred directly to the real world. “There is usually no goal to transfer, say, an RL (reinforcement learning) agent playing StarCraft to a real armed conflict (thankfully)” (Nikolenko, Reference Nikolenko2019), although with the advancement of technology such transferring will, most likely, become easier.
Synthetic data generation approaches have also been used in the context of urban management, where synthetic trajectories with realistic mobility patterns have been generated (Feng et al., Reference Feng, Yang, Xu, Yu, Wang and Li2020). Synthetic trajectories are crucial for urban planning (Kang et al., Reference Kang, Liu, Zhao and Ma2021), computational epidemiology (Cárcamo et al., Reference Cárcamo, Vogel, Terwilliger, Leidig and and Wolffe2017), and other types of tasks that require “what if” simulation, for example, changes in mobility patterns in the presence of new infrastructure or terrorist attacks (Luca et al., Reference Luca, Barlacchi, Lepri and Pappalardo2020). Synthetic data approaches to urban mobility data also show their effectiveness in protecting the privacy of trajectory micro-data (Fiore et al., Reference Fiore, Katsikouli, Zavou, Cunche, Fessant, Hello, Aivodji, Olivier, Quertier and Stanica2020).
Agent-based modeling techniques are often applied to the modeling of the social layer of the urban models, as these approaches allow for the behavioral simulation on an individual level, where the behavior of an agent is determined by its attributes on which its predictive capabilities are dependent (Chapuis and Taillandier, Reference Chapuis and Taillandier2019). These agents form a synthetic population of the model—a simplified microscopic representation of a real population (Antoni et al., Reference Antoni, Vuidel and Klein2017), which only keeps the attributes that are of interest to the model (Ziemke et al., Reference Ziemke, Nagel and Moeckel2016). Similar to the broader discussion of the synthetic data, synthetic population matches the aggregated statistical measures of the real population, without reproducing every single unit of the population (Lenormand and Deffuant, Reference Lenormand and Deffuant2013).
The synthetic population approaches have shown to be effective in epidemic simulation applications (Wu et al., Reference Wu, Luo, Shao, Tian and Peng2018) and in deriving synthetic population from the country-wide census for privacy protection of the individuals (Wickramasinghe et al., Reference Wickramasinghe, Singh and Padgham2020). This approach also has vast policy implications, because by using synthetic populations, policymakers can evaluate the city-scale built environment policies (He et al., Reference He, Zhou, Ma, Chow and Ozbay2020), analyze the risk for cardiovascular disease (Krauland et al., Reference Krauland, Frankeny, Lewis, Brink, Hulsey, Roberts and Hacker2020), and evaluate electricity consumption in a neighborhood (van Dam et al., Reference van Dam, Bustos-Turu, Shah, Jager, Verbrugge, Flache, de Roo, Hoogduin and Hemelrijk2017).
This wide spectrum of applications makes the approach of synthetic population derivation from the micro-level urban mobility data a potentially effective addition to the CDTs, as it will allow for both the simulation with individual-level data without violating privacy laws, as well as provide an instrument for “what-if” simulations about the events for which there are currently no available data. This fact can be especially relevant for policymakers working with CDTs, as data may simply not yet exist for some of the policy proposals that they want to test in a virtual urban environment (AImultiple, 2018). Creating artificial behavioral models of human behaviors in the city can be an important next step in the utilization of this technology, allowing policymakers to avoid falling into the trap of assuming that people will react to new initiatives similarly to how they reacted in the past (Levina and Duerk, Reference Levina and Duerk2018).
However, the effectiveness of the model with synthetic data is very dependent on the level of accuracy to which it represents the attributes of the real data (Chapuis and Taillandier, Reference Chapuis and Taillandier2019). As mentioned before, for certain potential policy interventions which may be of interest to policymakers working with CDTs, the data about the behavioral response to these interventions may not exist. Thus, purely synthesized data may not satisfy the development of the model for predicting the outcome of a certain intervention.
5. A Task-Based Approach to Urban Mobility Data Generation
In order to tackle this issue, a new approach for data generation is proposed. This approach can be utilized in the Virtual Singapore project discussed in detail earlier in this article. This proposal is based on the idea that the understanding of synthetic behavioral data should be expanded from the artificial data generated by the computer which retains the values of the real data, to the artificial data generated by humans which mimics the real data as if it was generated. This approach can be applied for situations when the data are needed about the potential behavioral response to the activities that have not yet happened and for which the data are not available but needed.
The crucial question in this proposal, however, is what methodologies can be applied for the generation of such fine-grained behavioral data, based on which synthetic models of human behaviors could be made, combined, and turned into synthetic populations of CDTs.
The approach that we propose in this article is informed by the data-labeling practices (Murgia, Reference Murgia2019) currently exercised by technological corporations. Data labeling service providers ask employees to conduct simple tasks, which cannot be executed by computers for the creation of data on which machine learning models will be trained. The services include a comparison of two images, search results, and interface designs; representativeness of search query with the search results; search of information on the Internet; and field tasks, such as checking if the business is still in operation or secretly buying a product from a store and writing a review. These services constitute the backbone of the AI revolution (GAHNTZ, Reference Gahntz2018), as they provide services crucial to the training of any AI system and employ millions (Reese and Heath, 2016) of blue-collar workers (Reese, 2016) around the planetary-scale labor market (Graham, Reference Graham2018).
Similar methodologies are discussed in the literature about games with a purpose, where manipulations with the real data are an integral part of the gameplay, as well as in urban data games, where the game content is based on real-world data. Data games refer to the games in which the game content is based on real data and the gameplay revolves around exploration and learning from the data (Friberger et al., Reference Friberger, Togelius, Borg Cardona, Ermacora, Mousten, Møller Jensen, Tanase and Brøndsted2013). The design of CDTs provides a perfect ground for the design of data games, as the data used for the creation of the model of the city is the real-world data that was compiled into a model. Urban data games have been previously used for increasing the data literacy among urban populations (Wolff et al., Reference Wolff, Kortuem and Cavero2015; Reference Wolff, Valdez, Barker, Potter, Gooch, Giles, Miles and Nijholt2017), which can be relevant in the context of CDTs, where urban data games can be created in order to involve the citizens in the process of understanding the data layers the city consists of.
The methodology of games with a purpose, games in which people create data or perform manipulation over data which computers cannot do as a side effect of the play (von Ahn and Dabbish, Reference von Ahn and Dabbish2008), can become another reference model for this proposal. In this sense, the act of playing should be treated as an act of data generation. These ideas can be extrapolated and taken outside of web-service optimization to the domain of public policy data generation.
Taking these ideas further, a Task-Based Approach to Urban Mobility Data Generation is proposed. Under this approach, city inhabitants would be asked to conduct certain activities in the city in order to generate data about the activities that have not yet happened but could happen. While the potential application of this methodology can be broad, urban mobility data generation seems to be the domain in which this approach can be applied first. Such services can ask people to conduct certain activities in the city—for example, going from place A to place B by using only certain types of transportation, asking people to imagine how they might have behaved under emergency situations with a set of parameters, or generating behavioral data in locations in which other sources of such data (GPS, cellular, or geo-tagged social media posts) are scarce. This approach can potentially become a valuable source of data for a situation in which reliance on historical data does not provide accurate predictions about certain hypothetical events, as well as for a situation in which micro-level individual data are needed for agent-based modeling purposes, but the data cannot be utilized due to privacy complications. This approach to urban mobility data generation does not violate any privacy legislation, because the data that are being generated by a person are about a hypothetical behavioral reaction to an intervention, but since the data are generated via specific tasks, they do not represent the behavior of a real individual, as the recorded behavior did not occur naturally.
As currently synthetic data are mainly used in cases where the real data are very sensitive and the methodology of synthetic data allows for anonymization of data through derivation of the data points that resemble the real data, the simulation games can provide an opportunity for anonymization on two levels. First, because the scenarios are hypothetical, the responses are not very reflective of any single individual. Second, the process of normal anonymization through synthetic data creation can be applied to this data in order to create second-order synthetic behavioral data, where the data values aggregated through simulation games can be approximated again.
Another potential benefit of this approach is that it allows the creation of a medium through which city inhabitants can be engaged in the process of urban policymaking, as they will be providing the data responses to certain proposals for intervention. This data response can serve as a medium through which the citizens can engage in the discussion about urban policy and design and express their opinion through their data responses. The approach can potentially be valuable for both policymakers and city residents, where the former will be provided with the data responses for the potential interventions for which the historical data does not suffice, while the latter will have a channel through which they can express their attitude toward the proposals.
Citizen upskilling and labor provision can be another way in which this approach can contribute to city development because, following the model of data-labeling services, city-level services for task-based data generation can be established. These services can become a vehicle for delivering public value through more-informed policies and citizen engagement.
In the case of Virtual Singapore, currently, the predictive capability of this CDT is limited by overreliance on historical data. As has been previously discussed, there are types of problems for which neither agent-based simulations nor machine learning approaches are helpful either because of the lack of data or because of privacy concerns associated with this data. Modeling of realistic human mobility and urban behavior in the times of the Covid-19 pandemic is one of these issues.
The utilization of the approach introduced in this article within the simulative environment of the CDTs, like Virtual Singapore, can allow for the creation of synthetic datasets representative of human behavioral response to hypothetical conditions. In terms of urban mobility data, under the conditions of the pandemic, this approach can be used for the approximation of the preferable mode of transportation in unusual situations for which the historical data is unavailable.
As argued in the McKinsey report (Hattrup-Silberberg et al., Reference Hattrup-Silberberg, Hausler, Heineke, Laverty, Möller, Schwedhelm and Wu2021), reducing the risk of infection has become the most important reason for choosing the mode of transportation in the city during the pandemic (time to the destination was the prime reason before the pandemic). Different cities had different experiences with the way the pandemic affected urban mobility patterns (Gragera et al., Reference Gragera, Albalate, Bel, Schaj, Cañas, Aquilué, Helder, Espindola, Mósca, Edelstam, Marti, Shetty, Barton, Riegebauer, Filohn and Urbano2021). For example, Singapore has experienced a cycling boom (Abdullah, Reference Abdullah2020) and a subsequent spike in cycling accidents (Abdullah, Reference Abdullah2021).
While estimating the true effects of extraordinary events like the Covid-19 pandemic and designing a response strategy to it is very hard, a task-based approach introduced in this article can be helpful for gathering behavioral data about unlikely yet possible scenarios. In the case of urban mobility during the pandemic, this approach could have been utilized in a way that the respondents were asked about the way they would get from point A to point B in the scenario of social distancing. This behavioral response could have provided the data about population’s preferences in terms of the mode of transportation by showing if people would become more reliant than usual on driving personal cars, cycling, or using electric scooters (in circumstances where previously they would have taken a public transport), as well as show people’s attitudes toward using car-sharing and bike-sharing services in a time when personal hygiene is of priority importance.
The insights generated from analyzing these behavioral responses can be relevant when designing a governmental strategy for urban mobility during the pandemic, as it will show to which areas the resources should be allocated and certain risks (such as with a spike of bicycle accidents) can be expected and potentially mitigated. While it is obviously impossible to gather data responses for all possible situations, the application of this approach is still much more realistic in terms of the resources needed than running a full-scale policy experiment on an urban scale.
There are several potential challenges associated with the utilization of this approach, which will need to be addressed. First, an incentive scheme needs to be established, at least at the early stages of the utilization of this approach, in order to encourage people’s participation. Second, it is highly likely that some groups of populations will be more interested and active in participating in this approach than other groups. Third, some participants can have biased views on certain situations, which can affect the quality of data. Fourth, it is possible that certain tasks will be attracting more participants than others, leading to the reduced quantity of gathered information. Finally, there are potential challenges in reaching out to and including minor communities for guaranteeing the inclusion of their opinions through this approach.
6. Conclusion
This article investigates the potential that the technology of CDTs has for simulating policy interventions. As policy responses to contemporary challenges require both planning and fast reaction, new tools for envisioning the future are needed. CDTs have the characteristics which can provide adequate platforms for such scenario planning.
However, as can be seen from the example of the Virtual Singapore project, current CDTs are mainly made for the manipulations with the built environment, thus providing a toolset predominantly aimed at the physicality of the city. While the simulations of the activities conducted by city populations in urban environments are parts of the CDT models, they are usually based on historical data, which limits the predictive capabilities for a unique situation for which data are unavailable, as well as poses certain threats to the privacy of the data collected about individuals. Due to these reasons, the usage of fine-grained micro-level behavioral data for governmental simulations is complicated. Synthetic data are believed to be a potential answer to these challenges.
In the context of this article, the idea of synthetic data is expanded from artificial data generated by the computer which retains the values of the real data to the artificial data generated by humans which mimic the real data as if they were generated. Based on this notion, a Task-Based Approach to Urban Mobility Data Generation is proposed. This approach is informed by the practices of data labeling, urban data games, and games with a purpose. Under this approach, the government may ask city inhabitants to conduct certain tasks in the city in order to generate fictional data, which would provide some insights about how people would react to a hypothetical policy intervention for which there exists no data yet as such intervention has never happened before. Based on these behavioral patterns, the government can create a synthetic population for the CDT which would provide information about a realistic behavioral response to a hypothetical policy intervention without violating any legal or privacy concerns, as the data generated through this method will not represent any real individual in society.
Abbreviations
- AI
-
artificial intelligence
- NASA
-
The National Aeronautics and Space Administration
Acknowledgments
This article is an extended version of the paper presented at the Data for Policy Conference 2020. The conference paper can be accessed via the link: https://zenodo.org/record/3967284#.X-Vx--kzbly. The authors would also like to thank Artem Nikitin, Svetlana Gorlatova, and Igor Sladoljev for the discussions at the Strelka Institute that originated some of the ideas from this article.
Funding Statement
This work received no specific grant from any funding agency, commercial, or not-for-profit sectors.
Competing Interests
The authors declare no competing interests exist.
Author Contributions
Conceptualization, G.P. and M.Y.; Methodology, G.P.; Writing – original draft, G.P.; Writing – review and editing, G.P. and M.Y.; Supervision, M.Y.
Data Availability Statement
Data availability is not applicable to this article as no new data were created or analyzed in this study.
Comments
No Comments have been published for this article.