Introduction
Depression and anxiety disorders are among the most common mental disorders and are leading contributors to global disease burden (GBD 2017 Disease and Injury Incidence and Prevalence Collaborators, 2018). Rates of depression and anxiety increase dramatically during adolescence, portending worse outcomes than onsets in adulthood (Beesdo, Knappe, & Pine, Reference Beesdo, Knappe and Pine2009; Fleisher & Katz, Reference Fleisher and Katz2001; Kessler, Chiu, Demler, & Walters, Reference Kessler, Chiu, Demler and Walters2005). However, predicting which individuals will experience depression and anxiety in adolescence remains an extremely difficult task. There is increasing recognition of the immense complexity of psychopathology, necessitating shifting away from simple etiological models and toward a complex dynamic systems perspective that recognizes that mental disorders arise from the interplay of numerous interacting components on multiple levels of analysis (Fried & Robinaugh, Reference Fried and Robinaugh2020). This shift in conceptualization, however, is not yet reflected in the dominant statistical paradigms used to study mental illness (Dwyer, Falkai, & Koutsouleris, Reference Dwyer, Falkai and Koutsouleris2018).
Traditional methods for examining prediction of disorder onsets (e.g. low dimensional linear and logistic regression) are highly limited in the complexity they can accommodate, both in terms of the number of variables and types of relationships (e.g. non-linear, multi-way interactions) that can be modeled simultaneously (Dwyer et al., Reference Dwyer, Falkai and Koutsouleris2018). Specifically, these methods are liable to overfitting as complexity increases, that is, they produce models that increasingly reflect the idiosyncratic characteristics of a particular sample and thus do not generalize well to other samples (Whelan & Garavan, Reference Whelan and Garavan2014). Machine learning (ML) methods, on the other hand, are uniquely suited to the task of prediction (Yarkoni & Westfall, Reference Yarkoni and Westfall2017). ML is an umbrella term that subsumes a range of flexible mathematical techniques that identify patterns in a (training) dataset with the central goal of producing a model that maximizes prediction of new (test) data. Evaluation of models on their ability to accurately predict out-of-sample data places generalizability and replicability at the core of ML (Coutanche & Hallion, Reference Coutanche, Hallion, Wright and Hallquist2019).
ML has the potential to make significant contributions to predicting depressive and anxiety disorders, accounting for the manifold relationships between the biological, cognitive, emotional, interpersonal, and environmental factors that give rise to affective psychopathology. Further, the prioritization of generalizability is critical to translating research findings to real-world applications in clinical settings and to improving replication of findings in a field facing a ‘replicability crisis’ (Tackett, Brandes, King, & Markon, Reference Tackett, Brandes, King and Markon2019). This potential is increasingly being realized, as there are already a number of ML applications to depression and (to a lesser extent) anxiety disorders (Shatte, Hutchinson, & Teague, Reference Shatte, Hutchinson and Teague2019), however, the existing literature remains limited in numerous ways.
First, many studies have trained their algorithms on low quality outcome measures (e.g. a self-report questionnaire; Andersson, Bathula, Iliadis, Walter, & Skalkidou, Reference Andersson, Bathula, Iliadis, Walter and Skalkidou2021; Su, Zhang, He, & Chen, Reference Su, Zhang, He and Chen2021). Additionally, despite the capacity of ML to accommodate a large number of features (i.e. variables), most studies have considered a limited range of potentially relevant factors. For example, a number of studies have exclusively used neuroimaging data (e.g. Sato et al., Reference Sato, Moll, Green, Deakin, Thomaz and Zahn2015), or medical records (e.g. Nemesure, Heinz, Huang, & Jacobson, Reference Nemesure, Heinz, Huang and Jacobson2021), with few combining several domains of interest (e.g. clinical records, personality measures, cognitive tests, and biological data). Both of these limitations are, at least in part, a consequence of ML requiring large samples, for which thorough diagnostics (e.g. by a trained interviewer), and extensive assessment of relevant risk factors, is less feasible.
Most existing ML applications in depression and anxiety have used fully cross-sectional data, seeking to improve detection of current depression or anxiety disorders (Guntuku, Yaden, Kern, Ungar, & Eichstaedt, Reference Guntuku, Yaden, Kern, Ungar and Eichstaedt2017; Kumar, Garg, & Garg, Reference Kumar, Garg and Garg2020; Liu, Hankey, Cao, & Chokka, Reference Liu, Hankey, Cao and Chokka2021). These studies are helpful for aiding in differential diagnosis, flagging individuals already in the health care system or identifying struggling individuals through social media posts; however, they are unlikely to generalize to prediction of future depression and anxiety, as the feature sets (i.e. the collection of predictor variables) include factors that co-occur with or are a consequence of the disorder. Building toward prevention requires training models on data that temporally precede disorder occurrence. A handful of studies have used ML methods to predict future depression or anxiety, although often in specific subgroups (e.g. postpartum; Andersson et al. Reference Andersson, Bathula, Iliadis, Walter and Skalkidou2021; Zhang, Wang, Hermann, Joly, & Pathak, Reference Zhang, Wang, Hermann, Joly and Pathak2021), over relatively short intervals (e.g. 6 months to 1 year; Bellón et al. Reference Bellón, de Dios Luna, King, Moreno-Küstner, Nazareth, Montón-Franco and Vicens2011; Eichstaedt et al. Reference Eichstaedt, Smith, Merchant, Ungar, Crutchley, Preoţiuc-Pietro and Schwartz2018; King et al. Reference King, Walker, Levy, Bottomley, Royston, Weich and Rotar2008), and, to our knowledge, exclusively in adult samples (Kessler et al., Reference Kessler, van Loo, Wardenaar, Bossarte, Brenner, Cai and de Jonge2016; Rosellini et al., Reference Rosellini, Liu, Anderson, Sbi, Tung and Knyazhanskaya2020; Wang et al., Reference Wang, Sareen, Patten, Bolton, Schmitz and Birney2014). Most individuals who will meet criteria for a mental disorder do so by the age of 18, with most first onsets occurring during adolescence (Caspi et al., Reference Caspi, Houts, Ambler, Danese, Elliott, Hariri and Ramrakha2020), so adult samples are similarly limited by either confounding risk factors with consequences of prior occurrence of mental illness or, when focusing exclusively on first onset cases, exclude the majority of individuals who will experience a mental illness in their lifetime.
A final and critical gap in ML applications to depression and anxiety disorders thus far is that the appropriate timing of risk assessment (i.e. at what age, how proximal to disorder occurrence) has been virtually unexplored. Determining the optimal timing of risk screening requires longitudinal assessment across more than two waves of data collection. To our knowledge, no prior study has used features assessed across multiple waves of data collection to predict depression or anxiety at a future wave.
Depression onsets peak in mid-adolescence, and while some anxiety disorders onset in childhood, rates of anxiety also increase drastically during adolescence, with some anxiety disorder onsets occurring during this period-early adulthood (e.g. social anxiety, panic, agoraphobia, generalized anxiety, and obsessive-compulsive disorder; Campbell, Brown, & Grisham, Reference Campbell, Brown and Grisham2003; de Lijster et al. Reference de Lijster, Dierckx, Utens, Verhulst, Zieldorff, Dieleman and Legerstee2017; Kessler et al. Reference Kessler, Chiu, Demler and Walters2005). Adolescent depression and anxiety are associated with a host of negative outcomes, including increased risk of disorder persistence and reoccurrence, increased comorbidity, and worse psychosocial functioning later in life (Essau, Lewinsohn, Olaya, & Seeley, Reference Essau, Lewinsohn, Olaya and Seeley2014; Fleisher & Katz, Reference Fleisher and Katz2001; McLeod, Horwood, & Fergusson, Reference McLeod, Horwood and Fergusson2016; Naicker, Galambos, Zeng, Senthilselvan, & Colman, Reference Naicker, Galambos, Zeng, Senthilselvan and Colman2013). Therefore, identifying individuals at risk of experiencing depression and anxiety during this developmental period is particularly critical. In the current study, we used ML to prospectively predict cases of depression and anxiety disorders during adolescence in an unselected community sample (N = 374). The feature set included a diverse and large number of potentially important risk factors spanning psychopathology, temperament/personality, family environment, life stress, interpersonal relationships, neurocognitive, hormonal, and neural functioning, and parental psychopathology and personality assessed at 3-year intervals across development from ages 3 to 12. The outcomes – diagnoses of depression and anxiety disorder at the age 15 wave – were assessed through a semi-structured diagnostic interview conducted by trained interviewers.
The primary purpose of this study was to leverage ML methods to evaluate the contribution of information from multiple developmental stages across childhood and early adolescence to prediction of depression and anxiety in mid-adolescence. This allowed us to address questions about the timing of risk assessment, such as how early risk assessment can be fruitful and whether longitudinal assessment provides substantially better prediction of risk than a single assessment at a key developmental stage. We additionally sought to explore the upper bounds of prediction that can be achieved when such a large number of highly relevant features spanning multiple domains are considered and to assess the incremental gains in prediction afforded by including such a volume of information (i.e. features) over a standard minimal risk assessment (specifically, recent disorder history and basic demographics).
To accomplish these goals, we compared prediction of disorder status at age 15 from disorder status at age 12 along with demographics, both alone and in combination with extensive risk factor data from individual prior waves and combining risk factor data across multiple prior waves. To meet the challenge of working with a large feature set spanning multiple waves, we used canonical correlation analysis (CCA), a multi-view dimensionality reduction technique that preserves the longitudinal structure of the data (Witten, Tibshirani, & Hastie, Reference Witten, Tibshirani and Hastie2009).
Methods
Procedures
Data were from an ongoing study of the development of psychopathology that has followed children and their families tri-annually since the participating child was 3 years old (Klein & Finsaas, Reference Klein and Finsaas2017). Initial recruitment of families with a 3-year-old child living within a 20-mile radius of Stony Brook, New York was conducted via commercial mailing lists. At each wave, families were invited to the lab to complete a battery of assessments. When lab visits were not feasible, questionnaires and interviews were completed remotely. A parent provided written informed consent at the start of each assessment and the child provided assent starting at the age 9 wave. The Stony Brook University Institutional Review Board approved all study procedures.
Participants
Families were eligible to participate if the primary caretaker spoke English and was the child's biological parent, and if the child did not have a significant medical disorder or developmental disability. Of the total of 559 participants, 374 were included in the current analyses (data exclusion described below). Included participants were predominantly male (53.5%), White (94.7%), and non-Hispanic (90.9%). Excluded participants did not differ significantly from included participants in demographic profile.
Measures
An online Supplementary excel file titled ‘List of Features’ contains the full list of features included from each wave. Briefly, the features covered a range of important domains, including clinical features (e.g. diagnoses and dimensional symptom scores of all common mental disorders), temperament and personality (e.g. behavioral inhibition, negative and positive emotionality, effortful control, intolerance of uncertainty, rumination), environmental factors (e.g. stressful life events, bullying, parental criticism and support), biological/neurocognitive factors (e.g. pubertal hormones, morning and evening cortisol levels, resting electroencephalography and event-related potentials in a variety of emotion-relevant tasks, executive functions, attentional and memory biases) and a number of parental factors (e.g. parental psychopathology and personality). Each assessment wave included parent and, starting at age 9, child interviews and questionnaires, saliva samples, and laboratory behavioral and neural measures. The features were not identical across waves. This is typical of developmental research, as the relevance of risk factors and appropriateness of modalities of measurement (e.g. self-report v. parent report) changes across development, but nevertheless leads to some confounding of age with differences in features.
Prior (age 12) and outcome (age 15) depression and anxiety were diagnosed with a semi-structured diagnostic interview, the Kiddie Schedule for Affective Disorders and Schizophrenia-Present and Lifetime version (K-SADS-PL; Axelson, Birmaher, Zelazny, Kaufman, & Gill, Reference Axelson, Birmaher, Zelazny, Kaufman and Gill2009). Diagnoses were based on the interval since the previous assessment (e.g. since the age 12 wave at the age 15 wave) and were used for the baseline and outcome variables. Doctoral students in clinical psychology and master's-level clinicians administered the K-SADS first to the parent (about the child) and then to the child (about themselves). Parent and child report of symptoms were combined into summary ratings, which were used to assign a diagnosis based on either the Diagnostic and Statistical Manual for Mental Disorders 4th (DSM-IV; American Psychiatric Association, 1994) or 5th (DSM-5; American Psychiatric Association, 2013) edition criteria. All cases with a suspected diagnosis were reviewed in a case conference co-led by a child psychiatrist and a clinical psychologist. Diagnosis of depression included major depressive disorder, dysthymic disorder, and depressive disorder-not otherwise specified (NOS; DSM-IV) or other specified depressive disorder (DSM-5); diagnosis of anxiety disorder included specific phobia, social phobia (DSM-IV) or social anxiety (DSM-5), agoraphobia, and panic, generalized anxiety, separation anxiety, obsessive compulsive, post-traumatic stress, and acute stress disorder, and anxiety disorder-NOS (DSM-IV) or other specified anxiety disorder (DSM-5). Interviewers independently rated videotaped interviews to assess inter-rater reliability (kappa = 0.72 and 0.91 for depression and anxiety disorders, respectively).
Data analysis
Preprocessing
The investigators preselected a subset of 429 features from all available data that comprehensively covered the range of constructs assessed in the study while minimizing redundancy (e.g. selecting a scale total score over correlated lower-order subscales). ML methods require complete data, so we excluded cases and features missing ⩾80% of data as well as cases without both outcome variables. Remaining missing values for features were imputed, using the mean for numerical features and the mode for categorical features. Categorical features with more than two levels were transformed into separate dummy-coded variables for each level (final feature set = 517).
Machine learning
This resulted in a feature set that was still very large relative to the number of observations. To mitigate multicollinearity and reduce the number of supervised model parameters (Hastie, Tibshirani, & Jermone, Reference Hastie, Tibshirani and Jermone2009), we used dimensionality reduction to reduce the features to 10 dimensions per wave while maximally preserving information (i.e. variance). To preserve the longitudinal structure (i.e. that groups of variables came from the same wave), we used a multi-view dimensionality reduction, CCA (Witten et al., Reference Witten, Tibshirani and Hastie2009), to create components (i.e. linear combinations of related features) within a wave that were maximally correlated across waves. This approach allowed us to have a single model yet keep each wave's low dimensional components separate from other waves in time (‘views’). We extracted 10 components per wave, as multiples of 10 are conventional and more than 10 components per wave would result in large multi-wave models (>40 features) at risk of overfitting in our relatively small sample.
Classification was performed using L2-penalized logistic regression, a regularization method that shrinks the regression coefficients by imposing a penalty on the maximum likelihood parameter estimate based on the squared magnitude of the coefficients as is standard in ML to guard against overfitting (James, Witten, Hastie, & Tibshirani, Reference James, Witten, Hastie and Tibshirani2013). The L2 penalty, also known as the ‘ridge’, adjusts for the collinearity between variables which has been found beneficial especially in longitudinal studies where multiple waves of similar variables covary (Eliot, Ferguson, Reilly, & Foulkes, Reference Eliot, Ferguson, Reilly and Foulkes2011; Miché et al., Reference Miché, Studerus, Meyer, Gloster, Beesdo-Baum, Wittchen and Lieb2020). Algorithms were trained on depression and anxiety outcomes using k-fold cross-validation (CV) with 10 folds. Briefly, 10-fold CV partitions all observations into 10 roughly equally sized, mutually exclusive, and randomized subgroups (folds). The algorithm is trained on 9/10 of the folds and the resulting model is used to predict the fold that was left out (i.e. the test set). Thus, the data used to train the algorithm are never contaminated with information about the data when their accuracy is evaluated. This process is repeated until predictions have been made for all 10 folds (Koul, Becchio, & Cavallo, Reference Koul, Becchio and Cavallo2018). Performance was indexed using the area under the receiver operating characteristics curve (AUC). AUC values were computed for each fold and then averaged across folds to produce a more stable estimate of out-of-sample performance.
Ablation analysis
We conducted an ablation analysis to evaluate the relative contributions to prediction of information gathered at different developmental periods and relative to prior disorder status (i.e. age 12 diagnosis of depression or anxiety disorder, depending on the outcome). Ablation analysis is a process of training algorithms on different configurations of features and then comparing performance metrics across configurations to assess which features are contributing to prediction (Fawcett & Hoos, Reference Fawcett and Hoos2016). We compared models containing CCA components from each individual wave (i.e. age 12, 9, 6, and 3) and in cumulative combinations (i.e. ages 12–3, 12–6, and 12–9) alongside prior disorder status and demographics (A12 Dx + Demos) to a model containing just A12 + Demos. Demographic features included sex, race, and ethnicity. To determine whether these comparisons had statistically significant differences in AUCs, we used a permutation test.
Sensitivity analyses
To ensure that our conclusions were robust to choice of classifier, we tested two additional classification algorithms: random forests and neural networks. Additionally, to demonstrate the advantage L2 penalization affords to prediction, we also fit models using traditional logistic regression as a benchmark of conventional statistical approaches in psychology and psychiatry. Details and results of these analyses are presented in the online Supplementary section S2.
All data analyses were conducted using Python 3.7 with libraries DLATK v1.1 (Schwartz et al., Reference Schwartz, Giorgi, Sap, Crutchley, Ungar and Eichstaedt2017) and scikit-learn v22.2 (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel and Dubourg2011).
Results
Depressive disorders
Table 1 displays prediction accuracy results for models predicting age 15 depressive disorders. The top section of Table 1 displays the AUCs for the models excluding A12 Dx + Demos (i.e. CCA components only), and the p values comparing these models to chance. Across models with CCA components from individual waves and combinations of successive waves, all but the models with only age 3 components (AUC = 0.556) and only age 6 components (AUC = 0.608) performed significantly better than chance (AUCs = 0.669–0.751).
A, age; A12 Dx, age 12 depression diagnostic status; Demos, demographics (sex, race, and ethnicity).
Note: Cells contain area under the receiver operating characteristics curve (AUC) values.
The bottom section of Table 1 displays the AUCs for the models combining age 12 depression and demographics with the CCA components (i.e. CCA components + A12 Dx + Demos), and the p values comparing these models to chance and to the comparison model without CCA components (i.e. A12 Dx + Demos alone). All of these models performed better than chance, except for the individual wave model with age 3 components (AUC = 0.599). All models combining components across successive waves performed significantly better than the comparison model (AUCs = 0.739–0.748). The only model including components from an individual wave that performed significantly better than the comparison model was the model with age 12 components (AUC = 0.744).
The comparison model including A12 Dx + Demos produced an AUC of 0.633, which was significantly better than chance (0.500). Without the demographics, age 12 depression status did not predict age 15 depression better than chance (AUC = 0.522).
Anxiety disorders
Table 2 displays prediction accuracy results for all models predicting age 15 anxiety disorders. The top section of Table 2 displays the AUCs for the models excluding A12 Dx + Demos (i.e. CCA components only), and the p values comparing these models to chance. Across models with CCA components from individual waves and combinations of successive waves, all models performed significantly better than chance (AUCs = 0.621–0.788).
A, age; A12 Dx, age 12 anxiety diagnostic status; Demos, demographics (sex, race, and ethnicity).
Note: Cells contain area under the receiver operating characteristics curve (AUC) values.
The bottom section of Table 2 displays the AUCs for the models combining age 12 anxiety and demographics with the CCA components (i.e. CCA components + A12 Dx + Demos), and the p values comparing these models to chance and to the comparison model (i.e. A12 Dx + Demos alone). All models performed better than chance and models combining components across successive waves performed significantly better than the comparison model (AUCs = 0.807–0.812). The only models including components from an individual wave that performed significantly better than the comparison model were the models with age 12 (AUC = 0.810) and age 9 (AUC = 0.805) components.
The comparison model including A12 Dx + Demos produced an AUC of 0.774, which was significantly better than chance. Without the demographics, age 12 anxiety status still predicted age 15 anxiety disorder better than chance (AUC = 0.720).
Sensitivity analyses
Results for the sensitivity analyses using different classification algorithms (i.e. neural networks, random forests, and logistic regression without regularization) are displayed in online Supplementary Tables S1 and S2 for depression and anxiety, respectively. Results of the main analyses were also included in these tables for easy comparison. The pattern of results was generally consistent with the main findings, with a few exceptions. For example, not all combinations of CCA components from consecutive waves improved prediction of age 15 depression over A12 Dx + Demos using the additional classifiers, while all combinations did so for the L2-penalized logistic regression models. Additionally, none of the models using the additional classifiers improved prediction of age 15 anxiety over A12 Dx + Demos, while for L2-penalized logistic regression models, all combinations of components from successive waves and from the individual age 9 and 12 waves improved prediction. Notably, L2-penalized regression generally produced higher accuracy rates than equivalent models in the sensitivity analyses, particularly the logistic regression models without regularization; however, among the additional ML classifiers tested in the sensitivity analyses, the highest performing classifier differed by model, though differences were mostly within the margin of error.
Discussion
The current study used ML to predict depression and anxiety disorders in mid-adolescence using information from multiple waves of assessment across childhood and early adolescence in an unselected community sample. Our primary aim was to determine the relative contributions to prediction of information from different and multiple developmental stages across childhood and early adolescence. We also sought to explore the upper bounds of prediction that can be achieved when such a large number of highly relevant features spanning multiple domains are considered and assess the incremental gains in prediction of including such a volume of information (i.e. features) afforded over knowing prior disorder status and basic demographics.
In regards to timing of risk assessment, our results comparing model performance to chance suggest that screening for risk of adolescent anxiety can be successful as early as age 3, whereas for depression, screening may only be successful starting at age 9. Accuracy estimates were higher at the more proximal waves, which is unsurprising as greater delay between risk assessment and disorder occurrence increases the odds that new risk and resilience factors come into play, decreasing the influence of vulnerabilities assessed earlier. In addition, youth self-report is not feasible until later ages, expanding the range of constructs that can be assessed.
It is notable that combining information across waves either only marginally improved prediction or worsened prediction relative to the individual wave models with components from the most proximal wave (age 12). This demonstrates the limitations of smaller sample sizes highly typical in psychology and psychiatry. Specifically, our multi-wave models contain magnitudes more parameters than the individual wave models, which translates to increased noise and a higher benchmark for detecting generalizable signals, in effect underfitting the data after shrinkage as compared to a model with less parameters (Hastie et al., Reference Hastie, Tibshirani and Jermone2009). In other words, it is more difficult to separate the reliable signals from the non-reliable noise in data with many features. However, larger samples could offset this and enable the detection of smaller effects that may be present in the multi-wave data.
Our sensitivity analyses provide another example of the tradeoff between complexity and generalizability in smaller samples. The models using neural networks and random forests performed equivalently or slightly worse than models using L2-penalized regression. The former two ML methods capture more complex relationships than the latter, practically translating into more model parameters and thus facing the same limitation (i.e. difficulty separating signal from noise). Much larger samples are needed to leverage the more complex features of ML (Hastie et al., Reference Hastie, Tibshirani and Jermone2009). Notably, however, the L2-penalized logistic regression models produced higher accuracy across the board than their counterparts using logistic regression without regularization, the more traditional statistical approach in psychology and psychiatry. This highlights the benefits of using ML methods, even in the relatively small sample sizes common in psychopathology research.
Often, the goal of ML studies is to develop a prediction algorithm that can be translated to applied settings (e.g. risk screening in a hospital). The current study does not share this goal, as the volume and variety of features could not be practically and economically assessed in any applied setting. Rather, this uniquely comprehensive feature set allows us to estimate the upper bounds of prediction that can be achieved in this idealized risk assessment context. Our highest performing model for predicting depression at age 15, which included all information (i.e. components) from age 9 and 12 waves, achieved an AUC of 0.751. For anxiety, our highest performing model achieved an AUC of 0.812 and was the model combining information from the age 9 and 12 waves alongside prior disorder status and basic demographic (i.e. sex, race, and ethnicity). For reference, these approximately correspond to Cohen's d values of 0.96 and 1.26, respectively, which are considered large effects (Rice & Harris, Reference Rice and Harris2005).
These findings are highly consistent with another prospective study using a similarly diverse collection of risk factors to predict depression and anxiety disorders in a mixed-age adult sample over an approximately 3-year follow-up period. Using data from the National Epidemiological Survey on Alcohol and Related Conditions (NESARC), Rosellini et al. (Reference Rosellini, Liu, Anderson, Sbi, Tung and Knyazhanskaya2020) obtained AUCs of 0.775 for depression and 0.780–0.799 for individual anxiety disorders. Achieving substantially better model performance may require more sophisticated ML techniques and use of less traditional types of data (e.g. social media; Guntuku et al., Reference Guntuku, Yaden, Kern, Ungar and Eichstaedt2017), for which much larger samples can more realistically be obtained.
Using our comparison model (‘A12 Dx + Demos alone’) as a reference point, we were able to evaluate whether gathering additional information beyond a basic risk assessment (demographics and history of disorder) is helpful. Our findings from models combing information across waves suggest that assessing additional risk factors longitudinally across development can improve upon a basic risk assessment for both depression and anxiety in adolescence; however, risk screening at any one timepoint earlier than age 12 for depression, and age 9 for anxiety, may not improve prediction beyond knowing recent disorder history and basic demographics.
It may not be surprising that information from more distal waves did not improve prediction over prior disorder status. Both depression and anxiety demonstrate a moderate degree of homotypic continuity across development (Beesdo et al., Reference Beesdo, Knappe and Pine2009; Kessler et al., Reference Kessler, Chiu, Demler and Walters2005), so it is likely that the vulnerabilities captured with the additional features are conferring risk for both the earlier and later instances of the disorder. In a statistical sense, age 12 disorder status is capturing nearly the same variance that is important for predicting age 15 disorder status, so it is difficult for prediction to improve. Further, as noted previously, most onsets of mental disorders occur during adolescence (Caspi et al., Reference Caspi, Houts, Ambler, Danese, Elliott, Hariri and Ramrakha2020), and rates of depression and many types of anxiety increase substantially during this time (Campbell et al., Reference Campbell, Brown and Grisham2003; de Lijster et al., Reference de Lijster, Dierckx, Utens, Verhulst, Zieldorff, Dieleman and Legerstee2017; Kessler et al., Reference Kessler, Chiu, Demler and Walters2005). Although the specific mechanisms producing this developmental pattern are unclear and likely manifold, it can reasonably be assumed that factors specific to the early-mid adolescence period (e.g. divergence in brain development, hormonal changes, increased relational and academic stressors), that would only have been captured in the older assessments, play an important role in influencing risk. An additional consideration is that we determined age 12 disorder status through a semi-structured interview administered by a trained interviewer. Such a thorough diagnostic assessment is often not practical in applied settings, which may only have the time and resources to administer screening questionnaires. In light of this, it is fairly remarkable that we were able to improve upon prediction by prior disorder status and demographics.
A final noteworthy result to address is the finding that baseline disorder status alone (i.e. excluding demographics) did not predict age 15 depressive disorders better than chance. We observed a fairly low rate of depressive disorder diagnoses at age 12 (N = 22), which is entirely consistent with its later age of onset (Kessler et al., Reference Kessler, Chiu, Demler and Walters2005). A number of individuals in our sample did not develop depression until age 15, and many more will experience first onset in the following years. Thus, we tested models excluding prior disorder status and compared performance to chance (i.e. results in the top section of Tables 1 and 2) for this reason. These models represent assessment contexts in which age 12 disorder history is not known and/or cannot be known (i.e. assessment prior to age 12).
This study possessed several strengths, including an unprecedented number and variety of important risk factors, a multi-wave longitudinal design allowing us to compare risk assessment across critical developmental periods, use of a multi-view dimensionality reduction technique allowing us to preserve the longitudinal structure of the data, and a rigorous outcome measure. A few important limitations should also be acknowledged. First the sample is relatively small by ML standards, limiting our investigation to less sophisticated ML techniques (i.e. L2-penalized logistic regression). While we used k-fold CV to increase generalizability to new data while maximizing test data, we did not test our models on truly independent data and thus accuracy estimates are likely slightly inflated. Features were not identical across waves, as is typical for developmental studies because few measures and risk factors are appropriate across development (e.g. young children cannot provide reliable self-report and peer relationships become more important in later childhood/adolescence). Nevertheless, the impact of differences in feature sets across waves cannot be fully teased apart from developmental differences in the relevance of vulnerabilities to risk. Additionally, through CCA we were able to impose an a priori structure based on time (wave of assessment) but, as can be seen in the online Supplementary file displaying the top features of each component, the components are not easily interpretable, combining features from multiple conceptual domains. We did not separate first onsets from recurrent and persisting cases because rates of disorders were relatively low, although typical of a community sample. The timing of optimal risk assessment may differ for first-onset cases. Finally, the sample is relatively geographically and racially homogeneous, limiting generalizability to more diverse populations.
Conclusion
In this study, we leveraged ML to prospectively predict adolescent depression and anxiety risk assessment across development. Progress in translating research to reduced burden of mental health has been stifled by overreliance on statistical approaches that cannot meet the challenge of capturing such complex phenomena. This study demonstrates the potential of ML, which can accommodate a large number and variety of relationships while prioritizing generalizability, to contribute to efforts to reduce suffering from mental health problems.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291722003452
Acknowledgements
This work was supported by the National Institute of Mental Health Grant R01 MH069942.
Conflict of interest
We have no known conflict of interest to disclose.