Policy Significance Statement
The open-access EUMigraTool (EMT) reports represent a pioneering effort to enhance communication and transparency for policymakers in the field of migration governance. Published every 6 months, these reports offer concise, insightful analyses derived from the EMT’s comprehensive research. By focusing on patterns of origin and transit countries, the reports provide a holistic view of the migration life cycle. Informed by extensive literature reviews, they encourage long-term policy provisions, fostering sustainable and effective migration governance. The reports serve as a bridge between research and policymaking, enabling evidence-based decision-making. Through dissemination on our website, social media, and engagement events, the EMT reports facilitate informed discussions, encouraging strategic, forward-thinking policies in Europe. This initiative represents a significant step towards bridging the gap between research insights and policy implementation, promoting a more comprehensive and proactive approach to migration management.
1. Introduction
In an era defined by unprecedented global mobility, understanding the patterns and processes of migration has never been more critical for NGOs and associations assisting migrants. The complexities of migration dynamics, influenced by a multitude of socio-economic and political factors, have intrigued scholars, policymakers, and practitioners for decades. With advancements in technology and data analytics, the field of migration studies has undergone a transformative evolution. Predictive tools, empowered by cutting-edge computational techniques and comprehensive data sets, have emerged as indispensable instruments for anticipating, analyzing, and responding to migratory trends.
Within this context, the EUMigraTool (hereinafter, the EMT) has been created as an advanced predictive analytics tool,Footnote 1 whose development represents a pivotal moment in the evolution of migration studies, where technology intersects with social science, leading to unparalleled insights and transformative results. Through a fusion of sophisticated machine learning algorithms, selected open-access data sets, and geospatial analyses, this tool has unlocked new dimensions for NGOs and civil societies to understand the intricate web of migration patterns in the EU. By harnessing the collective knowledge encapsulated in data, EUMigraTool has transcended traditional boundaries, enabling selected NGOs and humanitarian actors to anticipate migration flows with remarkable accuracy.
In this article, we embark on a comprehensive exploration of EMT, delving into its architecture, methodologies, and the outcomes it has yielded. We will reflect on the ethical implications inherent in the use of such predictive tools, emphasizing the importance of responsible data usage for tools that are intended to help vulnerable groups such as migrants.
The overall purpose of this article is to share and discuss the remarkable achievements of the EMT while presenting the potential it holds for the future. Additionally, this article sheds light on the transformative impact of predictive tools on humanitarian actors, where data-driven insights can inform migration policies and humanitarian responses, ultimately contributing to more informed and empathetic decision-making processes.
2. An overview of the EUMigraTool
The EMT is a software platform that integrates all the knowledge created within the “IT tools and methods for managing migration FLOWS” (ITFLOWS) project.Footnote 2 It provides to relevant stakeholders a set of tools, to enable them to do simulations and predictions on various migration aspects, ranging from the number of people expected to leave a certain region within selected countries of origin toward the European Union (EU), to potential challenges when migration populations arrive in EU territories (Stavropoulos et al., Reference Stavropoulos, Gevrekis and Iliopoulos2021). The EMT has reached Technology Readiness Level (TRL) 6 and is at a point where it can be used outside of the testing environment.Footnote 3
Territorially, the EMT is expected to be used within the European Union. Specific countries of origin and reception countries within the Union were identified at the beginning of the project (September 2020) to conduct the pilots. Such selection of the countries of origin used for the creation of algorithms has considered a wide range of considerations such as the intentions of migrants, the existence of previous work on modeling and validation data by the relevant expert partners of ITFLOWS, and the existence of accurate data about these countries. Taking the main flows in 2020 (and to some extent, also the possibility of continued flows in the following years), the following countries of origin were chosen for the models: (1) Syria, (2) Nigeria, (3) Mali, and (4) Venezuela.
The design of the EMT consists of two major components: (1) the front-end, which is what the user sees, providing a set of intuitive tools to set up the required prediction and simulation use cases the user wants to analyze, and (2) the back-end, which is responsible for collecting, storing, and managing all the data, along with performing all the necessary processing to produce the simulations and predictions (Stavropoulos et al., Reference Stavropoulos, Gevrekis and Iliopoulos2021). Regarding the material scope of the EMT, its main functionality consists of predicting displacements of migrants from outside the EU to (a) neighboring countries and (b) to the EU for asylum purposes.
For this study, only function (b) on the predictions of asylum applicants in the EU is examined. This function is found in the tool under the name “Predictions—to the EU—short term forecasts (1 month),” which operates under the so-called Large-Scale Model (LSM). It aims at serving NGOs in the organization of their staff and equipment for an immediate reception and assistance of asylum seekers arriving in their host countries. It is worth highlighting that the model for such predictions is updated weekly, so the closer the search date is to the prediction date (in the following month), the higher the accuracy is. For instance, an NGO may search for a prediction on August 1, 2025, for the number and gender of asylum seekers arriving in its country in September. The NGO would obtain a range number for arrivals, but if the same search criteria are filled out on August 25, 2025, the numbers may have changed slightly, as the variables are updated four times a month.
2. Data gathering, storage, and modeling
The LSM, which provides short-term predictions of asylum seekers’ arrivals, gathers data from GDELT (see details in Section 2.1), Eurostat, and All News data set by using scripts. Data are updated automatically and various scripts have been created to handle the process of downloading data and pushing these data to the chosen data repository, CKAN, which is explained below.
2.1 Repository and data sources
CKAN is a powerful Open-Source data portal platform that makes data accessible by supplying tools to streamline publishing, sharing, finding, and using data. It can be of wonderful use to data publishers (national and regional governments, companies, and organizations) wanting to make their data open and available (either privately or publicly). The reason for choosing CKAN out of all the possibilities is fundamentally its open-source nature and the fact that it comes with no financial cost. This means that it can be used without any license fees and you keep all rights to the data and metadata you enter. In the present case, CKAN serves as a data repository for the large-scale model and it automatically updates its content using automated parsing through public APIs. In addition, all data added in CKAN is completely anonymized and non-identifying, not to raise any legal/ethical risks (Stavropoulos et al., Reference Stavropoulos, Gevrekis and Iliopoulos2021).
The Directorate General of Eurostat—European Statistics on asylum applications (both broken down by nationality) provided data monthlyFootnote 4 and show the evolution of irregular arrivals and asylum application trends over the past 10years (considering gender disaggregation, to the extent possible). In addition to identifying visible peaks and drops in the volume of migrants and asylum seekers arriving in the EU from the origin countries under consideration, this allows to identify shifts in the main routes/entry points used by migrants and asylum seekers from particular nationalities at concrete points of time. Moreover, the data collected from Eurostat included not only the total number of asylum applicants and their countries of origin but also demographic data such as gender and age. This data was utilized in the MLP model’s training, allowing to derive more accurate estimates of asylum seekers, split by gender and age groups.
The GDELT data is collected via the GDELT Project, an initiative constructing catalogues for human societal-scale behaviors and beliefs from countries all around the world. It also includes catalogues and data regarding news sources, events across the world, and their context. GDELT includes data from 1979 to the present. Some older data sets are available in a yearly and monthly granularity, while the newest data sets are being updated every 15 minutes. Data files record events by using Conflict and Mediation Event Observations (CAMEO) coding. The database is one of the highest-resolution inventories of the media systems of the non-Western world and operates in near real time. It is described as a key for developing technology that studies the society of the world. The GDELT database provides 15-minute updates, Real-time translation of news written in 65 different languages, Real-time measurement of over 2300 emotions and themes, Relevant imagery, videos, and social embeds, and Quotes and Event discussion progression.
The GDELT Project offers numerous data sets, but EMT is mainly using the following: (a) GDELT 2.0 Event Database (GDELT Master), which by itself includes the GDELT Global Knowledge Graph (GKG) and the GDELT Mentions CSV data set, (b) the GDELT Global Quotation Graph (GQG), and (c) the GDELT Global Relationship Graph (GRG). Moreover, the GDELT API includes a country parameter that fetches article news from a specific country. The GDELT database includes national news websites from various reliable, well-known and established news sources from all around the world such as BBC, CNN, The New York Times, and others, and includes a wide range of regional and local news outlets as well. This allows for precise selection of articles of interest, based on country of origin.
The GDELT Global Knowledge Graph includes themes for reporting economic indicators like price grouping and heating oil prices for infrastructure topics and social issues like marginalization and burning in effigy. It includes lists of recognized infectious diseases, ethnic groups and terrorism organizations, and, in the 2.0 Database, there have been added more than 600 global humanitarian and development aid organizations. As far as the GDELT Global Quotation Graph is concerned, it compiles quoted statements from news all around the world. It scans each article monitored by GDELT and compiles a list of all quoted statements within, along with sufficient context to allow users in many cases to establish speaker identity. This data set covers 152 languages with minor limitations in capturing some quotes. Each quote is supplemented with a fragment of text before and after the quotation. The data set is updated every minute but is generated every 15 minutes for download. Finally, the GDELT Global Relationship Graph contains the assertions and relationships made in the global press every day. The data set, ultimately, has real-time updated verb-centred ngrams. The articles use part-of-speech tags, which may cause misclassification is some cases. Each verb is accompanied by up to six tokens before and after the verb. The ngrams are only generated around verbs, creating a fixed context around each verb phrase, capturing the statements of action related to the article. Like the GQG data set, GRG is updated every minute but is generated every 15 minutes for download. Thus, GDELT offers a huge database and the size of the data is too much to download. For reference and according to GDELT’s website, the size of just GKG alone is 2.5 TB for a single year. That is why there are written Python programs that download exactly which data are required, based on the name of the data set and the date. After downloading from GDELT, data are uploaded to the CKAN repository (Stavropoulos et al., Reference Stavropoulos, Gevrekis and Iliopoulos2021).
In conclusion, data from GDELT and Eurostat are being downloaded into the EMT’s data repository (and updated accordingly) and fed the relevant algorithms. Data collection uses CKAN as a data repository after filtering the data to meet the needs, translating, and cleaning up some words. With the help of these data sets, the EMT web app can gather the information it needs and analyze it to produce precise forecasts of asylum seekers. The script automatically runs every week, to collect the data sets from the previous weeks. The program is divided into two parts: data collection and data updating. It oversees downloading each data set, updating the current data sets from data collection each week and pushing them to CKAN.
2.2. The EMT large-scale model
Today, many models exist that can provide insights into predictive tools for migration. For instance, back in 2006, Jianli et al. (Reference Jianli, Jiazhen, Aizhen and Yan2006) proposed a prediction algorithm for migration paths of mobile agents, which can be applied to the design and application of intrusion detection systems. More recently, Singh et al. (Reference Singh, Singh, Hedabou, Masud and Alshamrani2022) introduced a predictive checkpoint technique using long short-term memory (LSTM) for iterative phase container migration in cloud computing, resulting in reduced migration time and data transmission. Similarly, Motaki et al. (Reference Motaki, Yahyaouy and Gualous2021) presented a prediction-based model for the live migration of virtual machines in a cloud data center, dynamically identifying the optimal migration algorithm based on prior system diagnosis. Finally, Chahal et al. (Reference Chahal, Ojha, Roy Choudhury and Nambiar2020) discussed the migration of a recommendation system to the cloud using a machine learning workflow, focusing on the performance of the recommendation system model when deployed on different virtual instances.
The LSM of the EMT proposes a different machine-learning model from those mentioned above. As a main difference, it is fed by open-access data and does not include personal data for its predictions. It specifically consists of the following stages (Stavropoulos et al., Reference Stavropoulos, Gevrekis and Iliopoulos2021):
-
(1) Data Processing: Data coming from the CKAN repository is cleaned in terms of data formatting and missing values. Both categorical and numerical imputation methods are applied and the data of the least value are removed.
-
(2) Feature Extraction: At this stage, the most essential features for migration flow prediction are extracted using both traditional machine learning algorithms (like support vector machine and linear regression) and state-of-the-art deep learning architectures (e.g., VGG16). These features include several indicators like violence, economic growth, public health care reach, climate anomalies, national sentiment towards migrants, political situation, and so on.
-
(3) Artificial Intelligence Module: Having acquired the best features possible, as far as quality and relevance are concerned, these are used as input to several machine learning classification algorithms (including both traditional and state-of-the-art approaches) to get a realistic estimate of the probability of a nation’s migration inflow/outflow and attitude towards migrants. AI outputs are accompanied by comments, as to how the conclusions were drawn. It will also examine the best way to avoid biased data sets and ensure a realistic estimate.
-
(4) End-User Requirements Satisfaction: This is the last stage of the model where the output of the Artificial Intelligence Module is formatted to the desired information specified by the end-user functional requirements. For example, one of the outputs is an estimation of the amount M of migrants coming from an X country of origin to a Y country of destination. Confidence intervals are provided of course for these estimations.
Figures. 1 and 2 provide a more visual representation of the topic modelling and the LSM pipelines. The initial step (Data collection) of the pipeline consists of the collection of news headlines from GDELT and historical records from EUROSTAT. Depending on whether the collected articles require translation or not, there is a translation stage where all articles are translated to English, and to make the topic classification more effective and less biased, source country names and domain brands are removed from the news headlines. The pre-processing step involves the elimination of stop words (Sarica and Luo, Reference Sarica and Luo2021), removing everything but nouns (single and plural), verbs and adjectives, lemmatization (Khyani and Siddhartha, Reference Khyani and Siddhartha2021), and tokenization (Webster and Kit, Reference Webster and Kit1992). In the vectorization step, the transformation of text data into numeric vectors takes place. Following this, every topic vector is passed through the LDA topic modeler and the resulting topic shares (which represent topic distributions) are utilized as input for a Multi-Layer Perceptron (Pinkus, Reference Pinkus1999), along with EUROSTAT’s asylum application historical records.
Gathering articles from GDELT would require translation and text cleaning before applying the LDA topic modeling algorithm to extract the topic shares and features. After the topics are extracted, the MLP algorithm combines these topics/features with data from Eurostat to produce the asylum seekers forecast. Documentation has been published (https://emt.itflows.eu/disclaimers/), explaining the bias limitations of the model and how the developers have identified and tackled this issue alongside limitations provided by the data sets and the data availability in general. Finally, as Step 4 explained above, the model provides interval predictions as confidence intervals with a 90% statistical confidence, meaning the lower and upper bounds of the predictions contain the actual point number. Eurostat provides 90% of the time this model produces a prediction for a bilateral case. For more information, see Stavropoulos and Gevrekis (Reference Stavropoulos and Gevrekis2022).
3. A tool designed and tested by end-users
Since the beginning of the ITFLOWS project, a total of 17 NGOs were involved as part of the Users Board (UB) in the design and validation of the tool. Table 1 below shows the members of the board that were involved in the development of the EMT.
For validation purposes, members of the UB have been presented throughout the last 3 years a series of use cases where they could test the capabilities of the EMT and identify its limitations. In several workshops, the UB members were asked how these functionalities would fit into their needs, routines, and activities.
The overall feedback received from the UB has been encouraging, underscoring the positive evolution of the development process of the tool, and included relevant recommendations that could be later used by developers to strengthen the EMT. Questionnaires included sections addressing both the current state of the tool at the moment they were conducted and future considerations, which was useful considering that, at the time, the EMT was a work in progress. When necessary, such recommendations were considered and incorporated.
The questions posed during the workshops were split into thematic groups. There were question groups for each major functionality of the EMT. The questions in these groups were designed and formulated similarly, to be able to compare them, but they also contained context-specific elements. Two additional question groups were focused on the overall services of the EMT: one for the non-functional element like glossary, look-and-feel, support element, and so on, and the other for the overall evaluation of the EMT.
To gather critical information and get a grasp on the social benefits and reach of the EMT, besides the workshops, the Users Board was sent online surveys. The questionnaire was structured thematically to cover the general perception of the viability of the tool, its two main functionalities, possible misuses, and its impact in the effort made by non-governmental organizations working on reception and integration. In broad terms, end-users perceived the EMT as a useful tool to improve their daily work, specifically by providing support, guidance, and direction when developing strategies and mechanisms to receive migrants and asylum seekers.
NGO representatives explained in the surveys that they expect the EMT to provide support to organizations working on reception and integration, to implement evidence-based measures and to better allocate their resources. In many instances, the material capabilities of certain organizations are limited, which is why this statement was repeated by more than one respondent. The rigorous and detailed information on the arrivals of migrants and asylum seekers could give direct end-users a credible picture of the situation in those individuals’ local contexts and potential migratory routes. Among other things, this would be helpful to foresee and detect possible dangers that those people could be facing throughout their journey. Concurrently, all respondents agreed on the fact that access to such aggregated data should continue to be granted on a case-by-case basis (via a personal identifier), maintaining the humanitarian purpose of the tool and ensuring that the exploitation is fully ethical and in line with the respect of human rights.
There were other representatives who pointed out the fact that their organization does not deal with short-term arrivals or, as some put it, “unexpected” migration flows. Nonetheless, they recognized that being able to predict and manage this type of potential arrivals could help them better shape their intervention, reading in advance possible emerging protection needs. All the same, there was consensus on the notion that the EMT could be a powerful prediction tool of migration flows to be used to strengthen the humanitarian response to arrivals in reception countries. With an ethical use of the tool, respondents defend that they expect that it would lead to the improvement of the migration response in general by the Member States of the European Union. First, in terms of increased efficiency in the asylum process by first respondents, without abandoning an approach based on humanitarianism when working also with relevant stakeholders. Second, regarding preparedness and infrastructure planning for the reception processes and mitigating risks of limited capacity. Third, to improve the policy-making process and understand the diversity of needs and challenges posed by migration and refugee movements.
Direct end-users also provided feedback on how good short-term predictions (up to 1 month) of the EMT would work, to allow their organizations to prepare for the arrival of asylum seekers. It was valuable to comprehend to what extent they could be able to put in place actions using those predictions, together with the nature of such measures. On the one hand, it is worth highlighting that many of the representatives answered that short-term predictions would be useful in fulfilling their objectives and preparations before the arrival of migrants and asylum seekers. Most of the time, a budget of their programs is under specific funding; however, by having predictions for arrivals from a specific country, they would be able to inform the earliest possible fund provider for an amendment to the budget (if necessary), or to search for submission of new projects and donations to cover the additional needs, calculated through those predictions. All the same, having this type of prediction would allow for more informative decisions on operational levels, based on stronger evidence rather than speculation. If such a functionality were to work well, mobilization of resources would be easier to organize and coordinate in anticipation.
Respondents sustained that the preparation for the arrival of people under short-term predictions would imply a substantial effort to organize and allocate material and human effort, which may not be always available within 1 month in advance. However, other immediate actions, such as language interpretation or legal orientation, could be put in place on time, together with the engagement with other organizations to join forces and deploy more rapidly resources where they were most needed.
On the other hand, other respondents explained that 1-month predictions would be too short of an advance, limiting their capability of response due to the margin being too small. For instance, it was stated that it would work better if those predictions were to go up to 3 months.Footnote 5 A respondent from an organization acting as a second reception center explained that it relies limitedly on this kind of short-term predictions, but still, they could be useful to better prepare and improve the matching criteria inside reception centers (in relation to nationalities, cultural background, food choices, or religious beliefs) and the functioning of the organization’s interdisciplinary approach to the asylum process.
Others responded by saying that, unfortunately, their organizations are not equipped/structured to work on such a short-term notice. Due to the fact of not being an emergency-situations organization (as one participant labeled it), they claimed that they could not respond to short-term predictions. Also, due to the sometimes-limited capacities and strength of a few of the organizations, they claim that 1 month would not be enough to provide an adequate response. However, these same respondents explained that they could design a solid communication campaign for awareness raising and advocacy. Despite of not having reception capacities, or directly working only with long-term arrivals, these predictions could help them include new flows into their advocacy work.
Deepening on the prediction functionality of the EMT, end-users were asked up to which point these could allow them to understand what the specific needs of arriving asylum seekers were, based on gender, age range, country of origin, and number of people. It was stated that it is always difficult to fully deduce the specific needs of arriving migrants and asylum seekers, but having some information on age, gender, and country of origin would definitely be of help; based on these information categories, specific items, services, and protection needs could be prepared and, later, deployed. In this sense, the predictions functionality and having access to detailed data deemed very useful for preparing the whole reception and accompaniment process, in terms of intercultural mediations, offering support to victims of trafficking, or legal and psychological assistance, allowing direct users to offer tailored responses to the asylum seekers that they host.
Then, some respondents offered more detailed answers and narrowed down how they would use this information according to the groups of migrants and asylum seekers that they work with. For instance, some information was provided by an organization working in the field of support of vulnerable groups, mostly unaccompanied minors. It was stated that by having information on the abovementioned indicators, it would be possible to ask for additional resources so that the organization is prepared to provide support and cover the arriving minors’ particular needs.
Other end-users that base their work largely on the specific needs of women and girls explained how having access to tailored data could allow them to provide a finer response for these groups. They underscored that gender is a key element that is rarely considered in emergency situations. In this sense, reducing the “time of the emergency” would allow them to work more thoroughly and effectively on gender issues. It was argued that, for the field of work of reception and integration, this is one of the most valued features of the EMT, as it helps organizations to analyze and foresee the different characteristics of the target group (for example, migrant women). This would help substantially in preparing and designing the actions to be deployed and to support better their needs. In terms of gender, this information could help anticipate and address gender-specific requirements, such as providing safe spaces for women who have travelled alone, addressing issues of gender-based violence or ensuring access to healthcare for reproductive needs.
Furthermore, UB underscored the fact that, above the generic needs of asylum seekers (which can be grouped together using broad parameters), further factors are missing in the EMT, although they are also important in the evaluation work of end-users. In particular, the question of the sex of the asylum seeker and the special needs of women and girls compared to that of men and boys. Women and girls have certain needs that are often not addressed by generic requirements. This can lead to problematic situations due to completely ignoring the issue of sex-based needs in terms of medical care or reproductive matters, among others.
The opinion of respondents regarding additional types of additional data – if any—that could be useful for their respective organizations was also obtained. Some respondents answered by saying that, for now, they could not think of any further data needed for their jobs. Others did suggest some information that they would deem valuable if accessible: (a) First, the number of unaccompanied and separated minors. This is a particularly vulnerable collective, and special mechanisms and procedures should be put in place before their reception to address their needs; (2) second, family composition, to prepare spaces and resources for specific groups of people that should remain together and that work as emotional support for each other; (3) third, it was argued that it would be helpful also to have information on the skills and level of education of migrants. As difficult to obtain as this sounds, it was argued that it would help design and promote mobility programs for migrants and asylum seekers, and thus benefit or facilitate their labor inclusion. This would also be a great opportunity to match locations within Member States where there is a shrinking workforce with the potential skills that migrants and asylum seekers bring with them, (4) and fourth, some more information on the situation and conditions in the countries of transit along the different migratory routes.
Moreover, respondents also addressed the issue of how the overall EMT design and appearance could be improved to make it more valuable to their organizations. Most end-users shared that the design and appearance are friendly to the user, that it is not complicated to work on it, that it is well designed, and it presents a user-friendly interface that allows reading the results in a visual and graphical way without previous technical knowledge on these types of tools. As a result, the widespread answer was not having additional suggestions to put forward at that moment. Those who mentioned a couple of possible improvements signaled introducing more elements on the visual impact, such as small icons that could reduce even more the time of reading and interpreting the data shown, or more detailed information on the countries of origin and transit.
Finally, respondents were asked if, besides the functionalities of prediction and attitudes, there should be any other added to the EMT or a similar tool. For those respondents who felt that additional functionalities could be added, either to this or another tool of similar nature, responses were quite different. For example, an end-user thought that it could be beneficial to be able to access data on the number of referrals for forced returns in different Member States, with disaggregation in age, gender, or country of origin. It was argued that it could be significantly useful to have, as now more than ever there is an effort by civil society organizations and certain media outlets to give visibility and cover the topic of the violation of human rights of migrants and asylum seekers in the external and internal borders of the European Union. In addition, another suggestion to have a more refined insight into the conditions in which migrants and asylum seekers arrive at reception countries would be to consider the effect of transit countries, both in the migratory routes and in the personal experiences of individuals throughout what often is a multi-staged journey with multiple border crossings.
4. Validation process and accuracy levels
Accuracy of the EMT predictions is crucial to attract NGOs and further expand the tool. To date, that tool has reached up to 80% accuracy in all the predictions. Although there is still a margin of error in some of the predictions, algorithms are being trained and validated with an increased number of historical data each month. In this sense, there is a consensus among the doctrine in the field that expert judgment combined with prediction and forecasting technologies allow for assessing past errors, facilitating policy debate, and providing long-term perspectives (Sohst and Tjaden, Reference Sohst and Tjaden2020). Reaching higher accuracy definitely helps NGOs understand what the actual needs of migrants arriving at the EU are and allocate necessary resources accordingly. For example, if short-term predictions (1 month ahead) indicate that there will be 100–150 asylum seekers arriving from Mali to Italy, and 60% of those will be women, NGOs could accommodate refugee camps so that all these persons have accommodation and reinforce equipment for women (feminine hygiene and personal care products, nutritional supplies for pregnant women, and so on).
To validate the prediction performance of the Large-Scale Model (LSM), the back-testing method was applied (Arnott et al., Reference Arnott, Harvey and Markowitz2018). This method is used to validate prediction values in time-series machine learning. To achieve this method, the performance of the model is tested against historical data. The accuracy of the model is determined by comparing predicted values with the actual values. The key performance metric, used to compare the above values, was the Median Relative Error (MdRE). The formula to calculate the MRE is defined as:
where y_true is the true value of the target variable (Eurostat asylum applicants) and y_pred is the predicted value (predictions of asylum applicants). To reduce the impact of outliers in the performance validation, the median was used instead of the mean. Τwo time periods were selected to simulate monthly forecasts from March 2018 to September 2019 and from March 2018 to June 2022 for seven origin countries (Afghanistan, Iraq, Morocco, Mali, Nigeria, Syria, and Venezuela) and eight European destination countries (Germany, Greece, Spain, France, Italy, Netherlands, Poland, and Sweden). The machine learning model was trained on data from 6 months before each forecast period (Iliopoulos et al., Reference Iliopoulos, Kopalidis, Stavropoulos and Tzovaras2022).
Figures 3 and 4 above show the model’s overall performance in terms of median relative percentage error. Figure 1 depicts the monthly forecast results from March 2018 to September 2019, while Figure 2 depicts the monthly forecast results from March 2018 to June 2022. The table contains four different colourization shades:
-
1) Light green (from 0 to 20): This indicates that the model performs very well.
-
2) Dark green (from 21 to 30): This indicates that the model performs well.
-
3) Orange (from 31 to 40): This indicates that the model performs poorly.
-
4) Red (from 41 to 100): This indicates that the model performs extremely poorly.
-
5) Grey: This indicates the cases where the model performs either extremely well or extremely badly due to cases of very low traffic with almost zero statistical variability.
Upon examining the results from both tables, it is clear that LSM performs well in bilateral cases with very high traffic. Another important observation is that across different time periods (2018–2019 and 2018–2022), the model performance remains consistent and reliable.
5. Ethical considerations
Before a tool like the EMT is off the ground and running, it is crucial that ethical and societal considerations are being applied. A data protection risk assessment was issued in February 2021 (6 months after the start of the project) and several monitoring evaluations were conducted afterwards by internal and external ethical experts. It is important to highlight that the EMT does not use personal data/identifiable data in its core. Although identifiable data were used by individual components during the training phase, this data was never passed to the EMT. All developers ensured full anonymization of the data they used, and that no one besides them had access to this data.
Moreover, data fed into the EMT comes from trusted sources. Thus, data quality checking is not needed. The CKAN repository that the EMT uses for data storage has embedded mechanisms to ensure the quality and integrity of the data. Cybersecurity mechanisms have been put in place to ensure the security of the system.
The identification and assessment of ethical risks of the EMT were conducted based on the Ethics Guidelines on Trustworthy Artificial Intelligence of the High-Level Expert Group on Artificial Intelligence of the European Commission (HLEG) (European Commission, 2019), the Assessment List for Trustworthy Artificial Intelligence for self-assessment of the HLEG (European Commission, 2020), and the Ethically Aligned Design guidelines developed by the Institute of Electrical and Electronics Engineers (IEEE, 2019). Following the methodological approach provided by such works—primarily the AI HLEG guidelines on trustworthy AI – a set of ethical principles based on fundamental rights was identified as the backbone of the AI impact assessment, to ensure that AI ethics is embedded in the EMT. In this sense, once the EMT is finalized and ready to use, the Consortium will make sure that both the internal and the external ethical boards of the project remain part of the core staff, to conduct regular audits of compliance of all necessary ethical, human rights, data protection, and AI requirements.
The following sections aim to elaborate on two key ethical issues that were considered in the design and development of the models and the EMT: the implementation of the principle of privacy-by-design and the minimization of the risk of misuse.
5.1. Privacy by design
Although the EMT models do not use personal data, the tool still collects and stores personal information from the searches conducted by registered end users. Therefore, Privacy by Design (PbD) has been implemented by building privacy into the design, operation, and management of the EMT system. In doing so, privacy has been ensured through every phase of the data lifecycle (e.g., collection, use, retention, storage, disposal, or destruction), as this has become crucial to avoiding legal liability and maintaining regulatory compliance. The amount of personal data should be restricted to the minimal amount possible (data minimization) close to none. In this sense, the EMT has sought to avoid and limit the need to collect and process personal data. If needed to collect personal data, the data subject is adequately informed whenever his/her data are processed (transparency) (Stavropoulos et al., Reference Stavropoulos, Gevrekis and Iliopoulos2021).
Regarding transparency, it is important to highlight that the EMT’s models are developed by research/academic partners. Their design, functionality, and results have been published in scientific journals/conferences and thus are publicly available for scrutiny. Moreover, details regarding the EMT models have also been provided within the EMT webpage to allow the users insight into the modules. The EMT includes explainable features in its results, in a comprehensive manner avoiding technical language, and explainability is one of the core design requirements of the EMT. Finally, information regarding the EMT’s functionalities is in the documentation pages on the website, including its limitations and shortcomings as well.
According to data protection regulations (e.g., the General Data Protection Regulation [GDPR]), the data subject has the right to control his/her data, including access, review, and/or delete his/her own data. Moreover, all personal data are hidden from public view. Anything that should be stored in a database will be encrypted and/or will be anonymized beforehand. To avoid the risk of privacy abuse, personal data will be stored in separate databases from the rest of the EMT infrastructure. A privacy policy is enforced in the EMT system, compatible with legal requirements, and EMT is applied to the highest privacy settings by default.
Following is a list of some privacy implementations within EMT: (1) EMT shall provide a form of consent before collecting any data from the users of the tool; (2) EMT shall summarize the content of data and why it is needed before collection, to avoid the collection of sensitive and unwanted data; (3) EMT shall implement privacy enhancing techniques that include anti-tracking, encryption of sensitive data (such as emails and passwords for authentication processes), and secure file sharing in order to avoid unwanted exposure of data; (4) EMT will store data anonymously whenever possible and applicable; and (5) EMT shall store personal and sensitive data in separate databases from the rest of the EMT infrastructure to limit loss or exposure of these data (Stavropoulos et al., Reference Stavropoulos, Gevrekis and Iliopoulos2021).
The EMT also incorporates design principles as architectural elements, fully describing their functional specifications and the interactions among them and the environment or the end-user when applicable. Such design principles have been established in line with the Privacy by Design principle (Article 25 GDPR), meeting all its requirements both from a GDPR perspective and from a design and end-user perspective. These include human rights, well-being, data agency, transparency, accountability, and awareness of misuse (Guillén and Teodoro, Reference Guillén and Teodoro2023).
5.2 Minimizing the risk of misuse
An important subject presented to direct end-users was whether the EMT could be misused, and how. In this sense, the ITFLOWS included an ad hoc and multi-staged monitoring process of the project, both from an internal and external perspective, with the aim of ensuring the adequate implementation of ethical safeguards, particularly regarding the EMT.
The purpose of the tool is to assist humanitarian actors working in the field of migration, so the project has integrated measures to protect the compliance of this purpose. It is worth highlighting the importance of participation on the technical side of the project for successful monitoring and implementation of measures. The results of these efforts considerably strengthen the data protection compliance of the project and thereby also reduce risks of potential misuse of project outcomes. In the first EMT risk assessment, the measures that envisioned in the EMT to ensure legal compliance included, among others: (1) provision of an informed consent form before collecting any data; (2) summary of the content of data and why it is needed before collection in order to avoid the collection of sensitive and unwanted data; (3) implementation of privacy-enhancing techniques such as anti-tracking, encryption of sensitive data, and secure file sharing; (4) anonymizing data whenever possible and applicable; and (4) separation of personal and sensitive data in different databases and separation from the rest of the EMT infrastructure.
5.3. A restrictive approach for the exploitation of the tool
NGOs working in the phases of reception and further integration of migrants arriving in the EU are a direct target audience of the EMT. In reaching them, communicating the tool’s limitations along with its value is important, to make sure that end-users understand how the insights from predictive technologies must work in tandem with other inputs like expert analysis, supervision, and monitoring.
As mentioned above, the EMT was created to serve only and exclusively humanitarian purposes. Such an aim may entail a limitation for expansion, as it restricts the feasibility of merging with other existing predictive tools in the field of migration, but at the same time makes our tool unique in the field. Expansion should only happen if the main target audience for the tool is preserved, namely NGOs in the field of migration. Developers of the tool are aware of the conflicting interests that some of these actors might have in offering access and integration to migrants arriving irregularly at the EU. Some of such migrants will intend to apply for asylum once they arrive in the EU territory, but if such access is denied they might have no opportunity at all to regularize their status at arrival. Therefore, the expansion of the EMT within the EU and beyond might preserve these foundations and will only take place if we find a way to increase the number of NGOs interested in using such a tool.
In the last Users Board Workshop on 16 June 2023,Footnote 6 there was a debate on how to continue the development of the tool after the completion of the EMT project. A few members of the Users Board proposed to create a partnership among the NGOs so that they could continue supporting the EMTs. In this sense, we have witnessed an increasing involvement by the Users Board in the design and testing of the tool, which makes us confident that it is a device that could be certainly useful among non-profit organizations working with migrants.
It is important to highlight that the EMT is not available for policy makers. Although establishing a humanitarian-purpose limitation for the tool means that access to its predictive functions is restricted to humanitarian NGOs and CSOs, we are aware of the importance that this tool may have for policy makers too. Therefore, although we do not envisage an expansion of the tool to include policymakers, we have put in place measures that increase communication, transparency, and participation by policy audiences. Particularly, the open-access EMT reports are oriented to better inform policymakers and provide them with useful, concise information. These reports are publicFootnote 7 and issued every 6 months, and they offer the main trends and insights of the tool results. They offer concrete case studies and information about the new developments of the tool, relevant for policymakers. The inspiration behind the EMT reports was drawn from a review of the existing literature on the difficulty of translating to policy and engaging key decision-makers and stakeholders at all levels of migration governance processes. We have been publishing the reports on our website(s), sharing them on our social media, dissemination events, and engagement with its Policy Working Group.
One of the EMT’s added values (and that of its reports) lies in its comprehensive account of the migration life cycle, which is important in long-term policy provisions to complement short-term decision-making (that is less oriented towards sustainable and effective migration governance). Offering insights as to patterns in origin and transit countries via the EMT reports offers policymakers in Europe at various levels of government and management information that could encourage long-term projects like capacity-building at origin, rather than focusing exclusively on policies in European host countries.
5.4. Merging options with other predictive tools or projects
One potential exploitation of the tool would be to integrate it into an existing forecasting or predictive platform or project that is oriented toward humanitarian purposes. The tool could thus complement a pre-existing system and be operated by, for example, a larger NGO that has the capacity to maximize the tool’s capabilities and employ it in their operations that seek to assist migrants and asylum-seekers arriving to Europe. As the literature suggests, there is no predictive or forecasting model that serves as the best or most adequate for all situations, and one way to mitigate the inherent possibility of error and uncertainty is by combing different methods together (Bijak, Reference Bijak2016; Bijak et al., Reference Bijak, Disney, Findlay, Forster, Smith and Wiśniowski2019). In incorporating the EMT into a pre-existing platform, it could even further equip the tool operator with a more complete picture of mixed migration and asylum flows to Europe.
In this framework, some of the tools identified with which EMT could merge, given the important development that all of them have, are: (1) the Jetson tool, funded and operated by UN High Commissioner for Refugees; (2) the Early Warning and Preparedness System tool, funded and operated by the European Asylum Support Office; (3) Foresight, currently funded and operated by the Danish Refugee Council; or (4) Internal Displacement Event Tagging and Clustering Tool, funded and operated by the Internal Displacement Monitoring Centre (Blasi Casagran et al., Reference Blasi Casagran, Boland, Sánchez Montijano and Vilà Sanchez2021).
Merging EMT with another predictive tool can be positive for two main reasons. First, it can bring more accurate and reliable predictions as a better tool is developed by taking advantage of the strengths of each of them, avoiding or decreasing possible errors. This in turn will bring with it a more complete coverage, since more data, more sources, more factors (or variables), and of better quality will be incorporated, thus allowing a more nuanced and comprehensive analysis of the migratory phenomenon and its complexity (Carammia et al., Reference Carammia, Iacus and Wilkin2022). In short, the union of different predictive IT tools can create a single, streamlined system that is easier to use and manage, and will avoid duplication of effort, thus saving time and resources.
Second, given the improved quality of the results, it will aid the decision-making process. Stakeholders will be able to access a wider range of data and perspectives, which will bring efficiency to migration planning and management. The combination of tools will enable decision-makers to develop policies that are better adapted to the needs of society. Similarly, the way in which data are communicated to all stakeholders, including society, will be improved. The benefit of this will be not only to predict future conflicts that may arise but also to act before tension arises and reduce the stereotypes that are created around the migration phenomenon. Ultimately, it will not only help transparency in management but also the collaboration between different actors.
The combination of tools can be carried out in several ways and will depend on the objectives pursued and the technical capabilities available. The first can be by surfacing the data sources used. The relevance of this way of surfacing data is because (1) it can improve the same data collected since it would go through a double check; (2) it could find flaws or problems in the databases already collected or in the way they have been collected; (3) it allows expanding the sources of analysis from which data are collected (e.g., Early Warning and Preparedness System tool collects data from press articles or social media posts); and (4) it will give the possibility of incorporating new variables or factors for analysis (EMT, e.g., is the only predictive tool that allows analyzing the attitudes between nationals and migrants through public opinion; Foresight is the only one that analyses natural disasters).
The second way in which tools could merge is by combining the forecasting models or algorithms used. In this framework, the technological development of AI could advance a more complete and accurate forecasting model that can incorporate not only multiple data sources but also new forms of more accurate and reliable forecasting.
Finally, a third possibility is that different tools, even separately, could be used to guide stakeholders at the same time. In this sense, and if there are no technical possibilities or resources to carry out a technological linkage, we propose that the foresight and the policy analyses were extracted from the results developed by different tools. In this sense, if stakeholders have access to various predictions, they will be able to develop more precise actions or policies.
In any case, linking forecasting tools is not without its challenges given the differences that exist between them. Not only because each may have different objectives but also because they may be using very different technological developments or very different data sources. In any case, access to both human resources (highly qualified technicians) and economic resources to enable the advancement of these tools, as well as their long-term maintenance, maybe the main problem encountered by this type of technological development.
To examine the possibility of merging the EMT with one or more existing predictive systems, a study of the current tools was conducted during the first half of the project. Below are the main tools that were identified:
After an exhaustive analysis of these existing predictive tools in the field of migration, the ITFLOWS Consortium attempted to merge with one of them: Foresight, currently funded and operated by the Danish Refugee Council. Although initially the institution showed interest in such an opportunity, the collaboration did not take place because of a lack of operational capacity to host additional initiatives at HQ level. Therefore, after 3 years of technical development and having reached TRL 6, the continuity of EMT is now put on hold. Members of the Consortium are not planning to merge with any other current predictive tool in the field of migration, since the majority of other existing tools are owned by governments (see Table 2 above) and merging with them would entail opening our tool to other end-users that we are not considered within this project.
6. Conclusions
In the rapidly evolving landscape of migration studies, the EUMigraTool (EMT) stands as a groundbreaking achievement, marking the convergence of technology and social science to address the complexities of human migration. Through meticulous development and rigorous testing in real-world environments, the EMT has demonstrated its potential as an invaluable asset for NGOs and humanitarian actors engaged in assisting migrants. As we conclude this exploration, several key insights and reflections emerge.
First, the EMT represents a significant leap forward in predictive analytics, offering a multifaceted approach to understanding migration patterns. Its fusion of cutting-edge machine learning algorithms, geospatial analyses, and open-access data sets has empowered NGOs with the ability to anticipate migration flows with remarkable precision. This predictive capability is not merely a technical achievement; it embodies a new era of data-driven humanitarianism, where informed decision-making can lead to more efficient resource allocation and targeted assistance, ultimately enhancing the lives of those in vulnerable situations.
Second, the journey of developing the EMT has illuminated the ethical dimensions inherent in the use of predictive tools for vulnerable populations, with some attention given to the potential misuse of the tool. The responsible collection, validation, and usage of data are paramount. The EMT underscores the vital importance of robust ethical frameworks and transparent practices when deploying technology to aid humanitarian efforts.
Furthermore, the collaborative efforts between technology experts, social scientists, and NGOs that led to the creation of the EMT underscore the transformative potential of interdisciplinary collaboration. By bridging the gap between technological innovation and humanitarian action, the EMT exemplifies how partnerships between academia, civil society, and the private sector can yield innovative solutions to pressing global challenges.
Looking ahead, the EMT not only represents a milestone in migration studies but also points toward a future where data-driven insights inform compassionate, empathetic, and well-informed policies and interventions. While the current chapter of EMT concludes with its technical development reaching TRL 6, our commitment to enhancing the landscape of migration studies remains steadfast. We are confident that as the landscape evolves, new opportunities and partnerships will emerge, enabling us to resume our mission with renewed vigor. By doing so, we will continue to advance the boundaries of knowledge, enhance the efficacy of humanitarian responses, and, most importantly, uphold the dignity and well-being of migrants across the globe.
Data availability statement
All the data the LSM is using are publicly available from their sources (Eurostat, GDELT). The output of the models and the code is not publicly available due to the organizations IPR strategy. Outputs are available to registered users in the EMT website.
Acknowledgments
The authors would like to express our sincere gratitude to A. Guillén, E. Teodoro, Z. Kardkovacs, N. Gkevrekis, and D. Morente for their invaluable contributions to this research. Their expertise, dedication, and collaborative spirit have greatly enriched our study, shaping it into a more comprehensive and insightful work. We are truly thankful for their assistance, which played a significant role in the successful completion of this project. Their unwavering support and thoughtful input have been instrumental, and we acknowledge their efforts with deep appreciation.
Author contribution
Conceptualization: Cristina Blasi Casagran; Georgios Stavropoulos. Methodology: Cristina Blasi Casagran; Georgios Stavropoulos. Data curation: Cristina Blasi Casagran; Georgios Stavropoulos. Data visualisation: Georgios Stavropoulos. Writing original draft: Cristina Blasi Casagran. All authors approved the final submitted draft.
Provenance
This article is part of the Data for Policy 2024 Proceedings and was accepted in Data & Policy on the strength of the Conference’s review process.
Will be formatted in a standard Cambridge style by our typesetter if accepted. No requirement to do this on submission.
Funding statement
This research was supported by grant from the European Commission, grant agreement No 882986. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interest
None.
Comments
No Comments have been published for this article.