25.1 Introduction
Public health data available for research are booming with the expansion of Big Data sources, shifting the landscape of DOHaD research. These new forms of data offer ample opportunities to advance epidemiological modelling within the DOHaD framework. Big Data is often described by the ‘3 Vs’: high volume, high velocity, and wide variety and refers, for example, to the large volumes of Electronic Health Records (EHRs) now stored as many nations move towards the routine electronic recording and centralising of health data. The term Big Data also applies to data derived from wearable devices and phone applications, increasingly affordable technologies that allow for the collection of new kinds of data, in larger volumes, and almost in real time. Such technologies, along with improved data processing speed and advanced computing capacity, grant access to the lifestyle and health information of millions of individuals who can be followed through the lifespan.
However, within heterogeneous and dynamic socio-demographic contexts and a fast-moving technological landscape, these new forms of data raise a plethora of methodological challenges related to accurately characterising population health trajectories and biological mechanisms. In addition, while the current inferential potential of DOHaD research depends on which variables are collected, at what frequency, and at what time points, it is also closely shaped by the theoretical model(s) chosen for a given study: a framework implicating critical and sensitive windows in development shaped the early DOHaD literature, but other models were added such as the accumulation of risk model, the chain of risk model, and a hybrid of those [Reference Johnson, Kuh, Hardy, Burton-Jeangros, Cullati, Sacker and Blane1]. These frameworks shape study designs, data collection practices, and the interpretation of results and set the scene for how Big Data is likely to be taken up in the field.
In this chapter, we provide an overview of DOHaD modelling methods and consider the emerging place of Big Data in investigating multidimensional research questions in the field. To do so, we discuss various methodological aspects of modelling, such as operationalisation, sampling, population representation, ethics, and the accuracy of tools used to acquire and analyse data. We also discuss the current landscape of artificial intelligence-derived methods, judging their utility against the validity of findings, and their potential when compared to ‘traditional’ empirical data sources and analytical approaches.
25.2 Current DOHaD Modelling and Methodological Challenges
A myriad of methodological challenges are present in DOHaD research even prior to Big Data, in particular, the issues of validity across time and space, characterising causal links, and identifying sources of bias. Physiological processes are difficult to model because they are integrated into non-static systems. These refer to the less quantifiable and less predictable behavioural, lifestyle, environmental, and socio-economic systems interacting with biology. The moderating or mediating effects of culture, health inequalities, medical systems, and health policy on health outcomes must be clarified to assess the generalisability of any given model. Even in gold-standard birth cohort research designs, epidemiological models need to account for the fact that, for example, through societal restructuring and climate change, the properties of exposures can change over time and across generations [Reference Zolitschka, Razum, Breckenkamp and Sauzet2]. As a result, it is difficult to produce predictive DOHaD models and interventions that remain valid and useful across time for a given population. Alternative designs such as the observational study design in humans cannot, however, capture the complexity of all important causal links.
A further challenge for aetiological and epidemiological models of DOHaD is how to define the sources of individual differences in health outcomes with a robust degree of certainty. Some examples include disentangling antenatal and postnatal exposure and their interactions [Reference Lapehn and Paquette3]; accounting for sex and gender-based differences in biology and behaviour; inter-organ variation in adaptability to maternal ill health (e.g. the placenta response to stress) [Reference Bowman, Arany and Wolfgang4, Reference Huynh, Dawson, Roberts and Bentley-Lewis5]; and evaluating disparities in outcomes across diverse groups given that the bulk of data is from a small number of middle- to high-income and white-dominated contexts [Reference Brandlistuen, Ystrom, Nulman, Koren and Nordeng6, Reference Abdul-Hussein, Kareem, Tewari, Bergeron, Briollais and Challis7].
Prediction models are also prone to confounding and collider bias [Reference Berkson8] (a variable in a causal pathway that is a shared effect of more than one cause). For DOHaD research, the primary exposures studied in the developmental pathway are nutrition, parental physiological and psychological health, the environment and toxicants, and social and demographic determinants [Reference Abdul-Hussein, Kareem, Tewari, Bergeron, Briollais and Challis7]. Intuitively it is easy to assume that many of these exposures can co-occur and may moderate one another. While confounding can often be resolved by taking these variables into account, residual confounding remains a risk when those influencing factors are unknown or unmeasurable. In the case of collider bias, this can also lead to counter-intuitive conclusions. Such counter-intuitive conclusions are exemplified by the ‘birthweight paradox’, where babies born with low birthweight (LBW) to smoking mothers (exposure) appear to have a lower risk of neonatal mortality (outcome) compared to those born with LBW to non-smoking mothers [Reference Whitcomb, Schisterman, Perkins and Platt9].
Here, tools such as directed acyclic graphs (DAGs) that portray causal relationships graphically can help a researcher explore a model’s functional assumptions and conceptualise any mechanisms of causality. DAGs help recognise mediators, moderators, confounders, and colliders [Reference Greenland, Pearl and Robins10] and have helped to illustrate the collider role of LBW in the above paradox [Reference Whitcomb, Schisterman, Perkins and Platt9]; that is, when maternal smoking is absent, other unobserved causes (malnutrition and congenital defects) can lead to LBW and more severe health problems and thus higher infant mortality. Therefore, while the above research question appeared ‘simple’ initially and involves few measurable exposures (smoking or not) and outcomes (LBW and mortality), failing to incorporate inter-correlations between variables and confounding effects, conceptualised by theory and DAGs, is unlikely to provide reliable causal inference.
Taking another example, understanding the association between maternal stress and lower infant cognitive outcome [Reference Wu, Espinosa, Barnett, Kapse, Quistorff and Lopez11] warrants pertinent exploration into the relative contributions of other exposures concomitant to maternal stress, such as under- or overnutrition, infections and toxicants, and their interrelationships. Here, modelling methods should integrate observed variables, latent variables (not directly observed but derived from questionnaires or other observed variables, i.e. stress), and their measurement errors, alongside time indicators. Moreover, the mediating mechanisms proposed in the literature, such as epigenetic modulation, the microbiome, metabolism, and (offspring) endogenous immunity, would also have to be incorporated. In practice, a single model that simultaneously incorporates multiple predictive pathways and associations will more accurately capture the causal relationships of interest between the exposure and outcome of interest [Reference Monk and Fernández12, Reference Sigurdardottir, White, Flynn, Singh, Briley and Rutherford13]. Of translational value, such epidemiological modelling would eventually result in developing better targeted interventions.
25.2.1 Data: What Are We Collecting, What Are We Measuring?
High-quality data that are fit for purpose and meet the criteria of accuracy, validity, completeness, and consistency are a cornerstone of empirical science. Data quality may be affected at the stages of data collection, cleaning, or the numerical transformation that is often used to meet required assumptions such as the normal distribution in statistics. Measurement errors, whether systematic or random, are present in all observational studies. While these errors impact the validity and reliability of data and introduce biases, they are rarely acknowledged or accounted for in the epidemiological literature. Makin and de Xivry address common statistical mistakes [Reference Makin and de Xivry14], and Wagenmakers et al. [Reference Wagenmakers, Sarafoglou, Aarts, Albers, Algermissen and Bahník15] present guidelines on how to report statistical analyses transparently based on four scientific norms of ‘communalism, universalism, disinterestedness and organised skepticism’.
Since we allude above to the notion of variable choice and availability, next, we discuss the importance of clear terminology and data quality in DOHaD research methods. How we define and measure exposures and outcomes impacts inferences, findings, and subsequent interventions and policies. One example of this is the work of researchers who rely on clinically defined groupings based on dichotomisation, such as diabetes diagnosis, or the classification of body mass index (BMI) as obese/overweight/normal-weight/underweight. One obvious risk of using a strict classification of body morphology by BMI alone is undermining the field’s knowledge about fat distribution being a strong determinant for metabolism and cardiovascular health, especially fat within the abdomen (visceral adiposity). Without other markers to corroborate metabolic health risks (blood pressure, cholesterol, visceral fat mass, etc.), some individuals with normal weight, categorised as ‘controls’, may be metabolically unhealthy and ‘at-risk’ of physiological phenotypes. In fact, this group represents 35 per cent of normal-weight individuals [Reference Fan, Qiu, Zhao, Yin, Li and Wang16]. This problem extends to gestational diabetes screening in pregnancy, which is provided to women meeting the BMI > 30 kg/m2 criteria in the UK, while those under 30 are assumed to be void of any hyperglycaemia risks during pregnancy. The absence of evidence, however, is not evidence of absence. The consequence of such hidden (latent) subgroups of individuals within a ‘control’ or ‘normal-weight’ category is the introduction of bias into the statistical analyses that epidemiological models rely on for inference, thus leading to inaccurate conclusions.
Similarly, in psychology, the diagnosis of autism as present/absent is common, although autism spectrum disorder (ASD) is typically conceptualised by experts as a continuum (see also Azevedo et al. in this volume). Such a binary diagnosis of ASD ignores potential distinct mechanisms of importance for the DOHaD of autism subtypes, which could possibly relate to the timing of any ‘disruption’ in brain development and could be informative for mechanistic studies and prognosis [Reference Lai, Kassee, Besney, Bonato, Hull and Mandy17].
Overall, what we suggest above is that DOHaD researchers must be conscious of relying on clinical data alone such as those retrieved from EHRs without clarifying sources of biases, the caveats of present/absent dichotomised diagnoses [Reference Altman and Bland18], and the local clinical guidelines from which they derive. We suggest, however, that some classification approaches are available to limit some of these caveats such as collating multiple variables and produce profiles based on similarities of exposure and/or outcomes at one time point (e.g. latent class modelling) or many time points over time to uncover trajectories (latent class growth analysis and piecewise modelling). This could mean, for example, retrieving glucose measures sampled throughout pregnancy and establishing the likely glycaemic status rather than relying only on a single GDM diagnosis. These approaches also help identify profiles of individual responses to interventions and can therefore improve tailored treatment allocation.
Additionally, missing data, either by design (e.g. unmeasured exposures/outcomes) or attrition, negatively impact data quality and challenge causal inference. This can be addressed by powerful analytical tools that recognise data complexity and the impact of missing data [Reference Stavola, De Stavola, Nitsch, dos Santos, McCormack and Hardy19, Reference Johnson20]. Several methods are available to address missing data, including maximum likelihood, multiple imputation, and Bayesian methods. Complete case analysis leads to loss of data and statistical power but is widely used, while other complex but more justifiable methods are not often attempted [Reference Bell, Fiero, Horton and Hsu21]. Assumptions about the properties of missing data, whether these are missing completely at random or not, must be made. With the emergence of Big Data and EHRs that tend to have a high prevalence of missing information, appropriate techniques that deal with missing data need to be carefully applied.
25.3 Big Data: Challenges and Opportunities
As referenced above, Big Data refers to data large in volume, collected at high velocity, and comes in a variety of sources, formats, and dimensions, such as from birth cohort and longitudinal studies, medical records, or wearable/phone devices. Birth cohort studies, such as the Avon Longitudinal Study of Parents and Children study, have supported the DOHaD hypothesis by in-depth prospective sampling and large multidimensional data collection from human participants. While integral to the DOHaD evidence base, standard cohort studies are costly and may be of limited size. EHRs, a source of Big Data, can be obtained from centralised systems, while large omics data sets (genomics, transcriptomics, metabolomics, etc.) are often sourced from biobanks and can be added to increasingly available personal and external ‘exposome’ data (e.g. lifestyle and environmental). Among the benefits of EHRs is their level of comprehensiveness, and so with larger samples, this also improves the statistical power required to provide accurate estimates of effect size. The availability of such data means that if taken up in DOHaD research, the scope for such studies would no longer be limited by small sample sizes due to funding and/or the restrictive protocols of conventional longitudinal birth cohorts [Reference Delpierre and Kelly-Irving22]. In practice already, linkage study designs join primary- and secondary-care databases or merge multiple EHR databases and registries, potentially offering new insight into disease pathways. For example, the UK-based CALIBER study drew on EHR sources to investigate the cumulative incidence and period prevalence of diseases over the lifecourse. Results were presented in the form of a chronological map of 308 physical and mental health conditions from four million individuals, from infants to the elderly [Reference Kuan, Denaxas, Gonzalez-Izquierdo, Direk, Bhatti and Husain23]. The role of universal medical coverage and centralised digital health records, such as the English National Health Service, in enabling such exploration in this particular study cannot be underestimated.
More recently, data acquired from real-time biosensors measuring pollution exposure, blood glucose levels, or heart rate from wearable devices have become available. Data involving behaviour and social networks can also be retrieved from open social media platforms at high speed. The past three years, especially during the COVID-19 pandemic, have seen a surge of software developments intended to meet the needs of monitoring health markers and well-being remotely. The uptake of telemedicine was enabled, for example, by digital platforms used by clinicians to manage antenatal hyperglycaemia [Reference Jardine, Relph, Magee, von Dadelszen, Morris and Ross-Davie24] and the self-report of glucose levels by pregnant women on their phones [Reference Mackillop, Hirst, Bartlett, Birks, Clifton and Farmer25]. These tools could be employed in future DOHaD studies. However, key ethical issues related to privacy, rights, and moral code of conduct when retrieving these data require careful considerations in this changing research landscape.
25.3.1 Limitations of Current Big Data Sources and Applications
One first potential caveat of relying on Big Data sources it that DOHaD researchers wanting to use Big Data may be obliged to formulate research questions based on data availability or including data not necessarily designed primarily for DOHaD research. These researchers will have less control over data quality because of the larger distance from data collection, that is the inputting user (clinician for EHR / hospitals, or user of a phone app), and from the decisions made in defining and measuring the variables in these data sets. Overall, sources of error need to be considered when evidence from Big Data is evaluated. Without researchers’ involvement in data collection, it may be impossible to subsequently correct or even identify these errors.
The task of comparing and validating DOHaD models across populations may be further hindered by the heterogeneity in data architectures across national and international sources. Before the term Big Data surfaced, omics-derived data alone (e.g. genomics, transcriptomics, and proteomics) already inferred the outputs of millions of data points [Reference Lapehn and Paquette3]. Formulating a cascading model of these omics layers, which follow biologically downstream from one another, is both necessary and extremely complex. Further, linking biologically derived material to clinical data of different formats, and based on a variety of measurements, including imaging, questionnaires, and diagnoses, requires technologies that facilitate multidimensional integration. Thereafter, powerful methods that support analysis are necessary.
Larger sample sizes improve the power to detect effects, and clearly the whole DOHaD framework requires both large samples and a comprehensive set of exposures and events to be modelled. However, the primary issue is that complex models are more difficult to explain and thus could complicate their practical translation into actionable policies. Users and clinicians equally need to be versed in their use and interpretation.
The use of Big Data also raises issues of data security and representation, particularly data obtained outside conventional academic institutions, in contexts where systems and resources are not fit for this purpose, such as in low- and middle-income countries (LMICs). Users of both healthcare services and digital platforms, such as social media, may represent distinct groups with possibly little overlap in demography, risks, and healthcare needs [Reference Delpierre and Kelly-Irving22]. It is plausible to assume that LMICs are unlikely to have population-wide health records or access to digital data collection and remote health monitoring from which DOHaD modelling could be done. Here, validation and reproducibility of DOHaD models are less feasible and highlight the lack of representation through their exclusion.
25.4 Artificial Intelligence: Challenges and Opportunities
When Big Data is considered, the word AI is not far behind. Pattern recognition, similarity profiling, and predictions are tasks for which AI methods such as machine learning (ML) have been developed, and these have potential applications to DOHaD research. It should be noted that ML and conventional statistical methods may be seen as a continuum since the algorithms behind ML, including linear and logistic regression, and several dimensionality reduction techniques have existed for decades. (For a contrast between ML and conventional statistics, see [Reference Bzdok, Altman and Krzywinski26].) However, the real advantage of AI is that it supports the analysis of large data volumes alongside multidimensionality (i.e. where the number of variables is larger than the subjects).
AI is already being tested and implemented in the clinical domain, including to improve the efficiency of hospital administration. AI is also being used to predict medication side effects and patient outcomes from radiological imaging and thereby promote patient-tailored medicine and interventions. In DOHaD, ML approaches would subserve exploratory designs to identify biological pathways, which appear more frequently in the mosaic of data, in the form of associations, including from DNA sequences and omics data, and those obtained from EHRs [Reference Bzdok, Altman and Krzywinski26]. Certain applications require user input from which the ML ‘learns’ to classify new data from sets of rules from previous data (supervised learning) or is completely unsupervised in detecting patterns. This is similar to the latent modelling techniques mentioned earlier that derive from the ‘classical’ statistics and the structural equation modelling framework. Other subfields of AI associated with ML include deep learning, rooted in multi-layered neural networks, which also allow computers to identify relations between concepts/features and characterise these associations from complex to simpler concepts [Reference Richards, Lillicrap, Beaudoin, Bengio, Bogacz and Christensen27]. ML methods to date could also assist in the processing of single modalities or data collection methods, such as magnetic resonance imaging of the brain, heart rate variability in the fetus, and DNA methylation patterns in disease [Reference Rauschert, Raubenheimer, Melton and Huang28], all of which are relevant for DOHaD research.
Despite the potential of ML for DOHaD described, very few ML studies have so far transitioned from single data type and scale to ‘fusing’ several dimensions, or and towards the integration of additional outcome measures retrieved from EHRs, medical imaging, and biospecimens. Data harmonising and deployment of ML is an ongoing endeavour, but some attempts have been made in relation to cardiovascular medicine (reviewed by [Reference Amal, Safarnejad, Omiye, Ghanzouri, Cabot and Ross29]). The issue to date is that these algorithms exploit two modalities at most (e.g. radiological imaging and free text from clinical reports). (For an in-depth review of the current landscape of AI for multi-modal integration, see [Reference Acosta, Falcone, Rajpurkar and Topol30]).
In our previous section, we discussed the necessity to characterise accurately DOHaD prediction models. It is at present the case that the several competing ML approaches and the rapidly evolving demands of AI have yet to produce a consensus regarding how to develop or validate a prediction model relying on these novel tools. For example, in predicting Type 2 diabetes and cardiovascular disease, Dalakleidi et al. reported the best performance to have been achieved by groups of artificial neural networks. However, the so-called decision trees, random forest algorithm, and support vector machines were said to provide the best accuracy measures by Zheng et al. [Reference Zheng, Xie, Xu, He, Zhang and You31]. Furthermore, AI methods do not necessarily outperform conventional statistical regression applications and are not free of methodological biases [Reference Collins, Mallett, Omar and Yu32]. Additionally, scholars warn that ML studies are often computationally demanding on resources (e.g. support vector machines, logistic regression, random forests, gradient-boosted machines, and neural networks) [Reference Christodoulou, Ma, Collins, Steyerberg, Verbakel and Van Calster33].
25.4.1 Issues of Interpretation and Reporting
Concerns are often raised about the scope, complexity, transparency, reproducibility across different scientific teams and different populations, and the interpretability of prediction models. While AI operates from a ‘black box’ within deep neural networks and unsupervised learning, the biological plausibility and meanings of the output are generated by researchers. Given that so far only a few published prediction models have found utility in clinical practice, the utility of AI when compared to conventional methods remains an open question. It is unclear whether AI can address the questions of causality most pertinent in DOHaD, when DOHaD draws from interpretability and theory and is moving towards the integration of social science and ethnography. (See Richardson, in this volume for a discussion of how DOHaD is characterised by ‘cryptic causality’.)
Guidelines regarding AI are developing rapidly. For example, following a quality assessment of the conduct and reporting of multi-variable prediction models, a 22-item checklist (TRIPOD) was developed [Reference Collins, Reitsma, Altman and Moons34]. The risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on AI has also emerged to ensure that users have key information about the design, conduct, and analysis, alongside a robust standardised tool for bias evaluation that would allow a fair judgement on the utility of these models [Reference Zheng, Xie, Xu, He, Zhang and You31, Reference Sounderajah, Ashrafian, Golub, Shetty, De Fauw and Hooft35, Reference Collins, Dhiman, Andaur Navarro, Ma, Hooft and Reitsma36]. Guidelines should be consulted by authors, reviewers, and editors, to ensure reproducibility, reliability, and validity, and hence safe implementation. Again, even prior to applying AI-derived analytics, the uncertainty of measurements in Big Data and its sources of error must be accounted for and possibly identified systematically, and the data quality validated.
25.6 Ethical Questions about Big Data and AI
The rapid expansion of Big Data and AI raises a range of ethical concerns. First, the question of who audits and protects data including EHRs (which can also be used as testing data by commercial parties) is central to ensuring ethical research in the DOHaD field and is currently insufficiently addressed [Reference Delpierre and Kelly-Irving22, Reference Ruckenstein and Schüll37]. The lack of standardisation of data protection laws between countries adds to this issue.
Additionally, the most powerful AI pipelines are deployed from within the very few corporations with the computational and financial resources. The commercialisation of both healthcare software or AI tools and their findings within the private sector is another challenge that research institutions must navigate if the potential of such data is to be realised. Such a feat would require more transparency and possibly a move to open sources of data. (For an example of Google’s DeepMind approach to open data, see [Reference Jumper, Evans, Pritzel, Green, Figurnov and Ronneberger38].)
Nevertheless, open and public data collection is also likely to introduce other ethical issues that need to be carefully considered [Reference de Laat39]. Big Data collection and usage may move the position of the individual (the unique data provider) from the one fulfilling the ‘social vision’ of the healthcare system and science into the ‘economic vision’ of the commercial enterprise [Reference Delpierre and Kelly-Irving22]. Participants recruited through academic institutions consciously engage with the scientific community with consent and pledge their time voluntarily. This contrasts with the passive and often unwitting involvement in Big Data collection via the data generated by medical records and phone devices. The use of such data without consent raises concerns regarding data ownership, privacy, and the circulation of profits. Data protection in secondary research by academic institutions using public data is enforced by the institutions themselves through university and institutional ethics boards, but enforcing consent and data protection may be less clear when third-party commercial and private bodies are concerned.
Rarely mentioned in the discussion of AI-derived methods are the risks introduced or heightened to certain populations because of their implementation. Experts such as Professor Kate Crawford at the AI Now Institute are re-evaluating the societal burden of AI. She states AI is not ‘artificial’ since it requires the same earthly resources and labour to mine the power and hardware sustaining it. Consequently, this is also becoming a source of disparity and power imbalance on the ground, within and between populations who compete to mine these resources [Reference Campolo and Crawford40]. Such ramifications would be the real irony for an AI integration into DOHaD research and the long-term agenda of the scientific community.
25.7 Conclusion
There is a strong anticipation that in the future, DOHaD researchers will benefit from innovative methodological designs. This could build on the best of current biostatistical methods and soon include AI technologies where the multidimensionality of data sources and a longitudinal format can be integrated, and the outputs shown to be interpretable. Current and novel ‘mega’ projects may push this progress forward. An example is the protocol implemented in the EarlyCause project [Reference Mariani, Borsini, Cecil, Felix, Sebert and Cattaneo41] that will explore the causal mechanisms between early-life adversity (antenatal and postnatal) and future psycho-cardio-metabolic multi-morbidity. It will involve the participation of 14 European institutions and include three complementary and sequential phases that integrate longitudinal population data sets (e.g. ALSPAC and UK BIOBANK), animal studies, and cellular models with analytical tools from structural equation modelling and machine learning. It also aims to offer a web-based platform for data access and information on research standards and best practices to support future study designs and exploration. Such a mix of granular data collection and Big Data sources in open access, along with AI and conventional statistical approaches, holds great potential for DOHaD research.
The DOHaD research community may look to other fields and consider how to train their own data and solution architects in the newest technologies and Big Data usage. Interdisciplinary teamwork will be crucial in ensuring both robust management and use of data as well as anticipating ethical and governance issues [Reference O’Doherty, Shabani, Dove, Bentzen, Borry and Burgess42]. It is crucial to assess whether certain limitations are inevitable or can be remedied to create the necessary, transparent, and reliable evidence base. Collaborations in data collection could be expanded more frequently to ‘crowdsourcing’ in data analysis and interpretation. Of course, teamwork does not come without caveats when studying complex and dynamic modelling, such as leading to further heterogeneity in findings and conclusions [Reference Silberzahn and Uhlmann43]. Nevertheless, we reiterate that attention to operationalisation of exposures and outcomes, reducing bias in data collection and analysis, and the necessity for interpretability should be at the forefront of the DOHaD agenda in the era of Big Data and AI.