Introduction
Early detection of mental disorders has become a growing field with remarkable progress. Validated techniques for the individualized prediction of transition to diagnosed disorder are becoming increasingly available (Fusar-Poli et al., Reference Fusar-Poli, Rutigliano, Stahl, Davies, Bonoldi, Reilly and McGuire2017, Reference Fusar-Poli, Werbeloff, Rutigliano, Oliver, Davies, Stahl and Osborn2019; Koutsouleris et al., Reference Koutsouleris, Dwyer, Degenhardt, Maj, Urquijo-Castro and Sanfelici2021). In the case of bipolar disorder, early detection plays a special role, since the correct diagnosis using current diagnostic approaches occurs in average 8.7–12.4 years after the appearance of first symptoms (Kessler et al., Reference Kessler, Berglund, Demler, Jin, Merikangas and Walters2005; Lambert et al., Reference Lambert, Bock, Naber, Löwe, Schulte-Markwort, Schäfer and Karow2013; Merikangas et al., Reference Merikangas, Jin, He, Kessler, Lee, Sampson and Zarkov2011; Pfennig et al., Reference Pfennig, Jabs, Pfeiffer, Weikert, Leopold and Bauer2011). This goes along with risks of incorrect treatment, such as antidepressant-induced (or unrecognized) mania (Lambert et al., Reference Lambert, Bock, Naber, Löwe, Schulte-Markwort, Schäfer and Karow2013; Pfennig, Bschor, Falkai, & Bauer, Reference Pfennig, Bschor, Falkai and Bauer2013).
Aggregation of big data from multiple centers and machine learning has enabled individualized predictions for diagnostics, prognosis, and therapy response (Dwyer, Falkai, & Koutsouleris, Reference Dwyer, Falkai and Koutsouleris2018). In the field of early recognition, psychosis risk has received the largest attention (Kambeitz-Ilankovic et al., Reference Kambeitz-Ilankovic, Meisenzahl, Cabral, von Saldern, Kambeitz, Falkai and Koutsouleris2015; Koutsouleris et al., Reference Koutsouleris, Riecher-Rössler, Meisenzahl, Smieskova, Studerus, Kambeitz-Ilankovic and Borgwardt2015, Reference Koutsouleris, Kahn, Chekroud, Leucht, Falkai, Wobrock and Hasan2016, Reference Koutsouleris, Kambeitz-Ilankovic, Ruhrmann, Rosen, Ruef, Dwyer and Borgwardt2018). Prediction of transition to psychosis in high-risk subjects can be substantially improved using machine learning, achieving up to 85.5% accuracy when combined with clinicians' judgments (Koutsouleris et al., Reference Koutsouleris, Dwyer, Degenhardt, Maj, Urquijo-Castro and Sanfelici2021). Disproportionately fewer machine learning studies have focused on the early recognition of bipolar disorder (Claude, Houenou, Duchesnay, & Favre, Reference Claude, Houenou, Duchesnay and Favre2020).
Among neuroimaging data, structural magnetic resonance imaging (MRI) is especially suitable for diagnostic and prognostic analyses using machine learning techniques. Most psychiatric disorders have been associated with brain structural markers or alterations. Recent large-scale multicentric studies of major psychiatric disorders within the ENIGMA consortium showed that along with schizophrenia, bipolar disorder ranks highest in cortical thinning among major conditions beginning in early- to mid-adulthood (Abé et al., Reference Abé, Liberg, Song, Bergen, Petrovic, Ekman and Landén2020; Ching et al., Reference Ching, Hibar, Gurholt, Nunes, Thomopoulos and Abé2020). Unlike major depression, attention-deficit hyperactivity disorder (ADHD), obsessive-compulsive disorder, or autism, both disorders seem to be associated with similar patterns of large-scale cortical thinning in frontal, temporal, and parietal regions with relatively high effect sizes. From a practical point of view, structural MRI (sMRI) requires relatively short scanning sequences, modest compliance, and displays high test–retest reliability (Hedges et al., Reference Hedges, Dimitrov, Zahid, Brito Vega, Si, Dickson and Kempton2022). Unlike genetic predisposition, which is a major risk for bipolar disorder with transition rates of 4.2–22.4% by first-degree relatives (Hafeman et al., Reference Hafeman, Merranko, Goldstein, Axelson, Goldstein, Monk and Birmaher2017; Kerner, Reference Kerner2014; Post et al., Reference Post, Altshuler, Kupka, McElroy, Frye, Rowe and Nolen2018), using sMRI in assessment of risk for bipolar disorder has been rarely investigated.
Individuals at risk for bipolar disorders have been studied using two major approaches – family cohorts, i.e. first-degree relatives (Hajek et al., Reference Hajek, Cullis, Novak, Kopecek, Blagdon, Propper and Alda2013), and help-seeking populations (Pfennig et al., Reference Pfennig, Leopold, Martini, Boehme, Lambert, Stamm and Bauer2020). The latter approach enables for studying a broader range of risk factors including specific subsyndromal manic or depressive symptoms, mood swings, changes in sleep and circadian rhythm, anxiety, ADHD, specific character traits, stressful life events, or substance use (Faedda et al., Reference Faedda, Baldessarini, Marangoni, Bechdolf, Berk, Birmaher and Correll2019; Leopold et al., Reference Leopold, Ritter, Correll, Marx, Özgürdal, Juckel and Pfennig2012). For this purpose, and in order to facilitate the risk recognition in help-seeking cohorts, several risk assessment tools have been developed, including (extended) bipolar-at-risk criteria [BAR(S)] (Bechdolf et al., Reference Bechdolf, Ratheesh, Cotton, Nelson, Chanen, Betts and McGorry2014; Fusar-Poli et al., Reference Fusar-Poli, De Micheli, Rocchetti, Cappucciati, Ramella-Cravaro, Rutigliano and Falkenberg2018), Bipolar Prodrome Symptom Interview and Scale (BPSS-P) (Correll et al., Reference Correll, Olvet, Auther, Hauser, Kishimoto, Carrión and Cornblatt2014), and the EPIbipolar interview (Leopold et al., Reference Leopold, Ritter, Correll, Marx, Özgürdal, Juckel and Pfennig2012). It is a strength of our study that all of these three scores are available for our cohort and were investigated as the dependent variable.
Several studies have explored the use of machine learning in classifying diagnosed bipolar disorder (Hajek et al., Reference Hajek, Cooke, Kopecek, Novak, Hoschl and Alda2015; Nunes et al., Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020) and individuals with high genetic risk for bipolar disorder (i.e. first-degree relatives). A review by Claude et al. (Reference Claude, Houenou, Duchesnay and Favre2020) identified five studies that aimed to classify persons with genetic risk using different modalities, achieving accuracies from 59.7% up to 83.21%. Among those, two studies used regional cortical volumes (Hajek et al., Reference Hajek, Cooke, Kopecek, Novak, Hoschl and Alda2015; Lin et al., Reference Lin, Shao, Geng, Chen, Lu, Gao and So2018) and two used functional MRI (Frangou, Reference Frangou2019; Mourão-Miranda et al., Reference Mourão-Miranda, Almeida, Hassel, de Oliveira, Versace, Marquand and Phillips2012; Roberts et al., Reference Roberts, Lord, Frankland, Wright, Lau, Levy and Breakspear2017). To the best of our knowledge, no multicenter machine learning study has yet been conducted to classify risk scores for bipolar disorder while including, but not being limited to the genetic risk. Based on the data from the Early-BipoLife study (Pfennig et al., Reference Pfennig, Leopold, Martini, Boehme, Lambert, Stamm and Bauer2020), we aimed to train a machine learning classifier using 10-fold cross-validation to stratify help-seeking subjects by estimated risk using sMRI. In contrast to single-center studies, we also used the multicenter design to validate it on test data from an ‘unseen’ study site through a leave-one-site-out cross-validation. Our results may provide a proof-of-concept for the utility of sMRI data for individualized risk prediction in subjects seeking help.
Methods
Pre-registration
We pre-registered our analyses at the Open Science Framework (https://osf.io/c4hfn).
Sample
The data were collected within the multicenter Early-BipoLife study (Pfennig et al., Reference Pfennig, Leopold, Martini, Boehme, Lambert, Stamm and Bauer2020; Ritter et al., Reference Ritter, Bermpohl, Gruber, Hautzinger, Jansen, Juckel and Bauer2016). Early-BipoLife is a multicenter, naturalistic, prospective-longitudinal observational cohort study of adolescents and young adults (age 15–35) at risk for bipolar disorder. From 10 participating German university and teaching hospitals with early detection centers/facilities for bipolar disorder, seven centers (Berlin, Bochum, Frankfurt, Hamburg, Dresden, Marburg, Tübingen) acquired MRI data. For this study, we accessed the baseline clinical and MRI data. For a detailed description of data collection procedures, see Pfennig et al. (Reference Pfennig, Leopold, Martini, Boehme, Lambert, Stamm and Bauer2020). Briefly, of the total N = 1229 recruited adolescents and young adults at risk, N = 313 opted to receive MRI. In order to include all proposed risk factors for bipolar disorder, we recruited the participants in three recruitment pathways: N = 123 were consulting early detection centers/facilities and were screened positive for ⩾1 proposed risk factor for bipolar disorder (see online Supplementary note 1), N = 146 were young in- and outpatients with a depressive syndrome, and N = 44 had an established diagnosis of ADHD. In order to include older individuals who might have an unrecognized bipolar disorder (e.g. due to presence of exclusively depressive episodes, but no full-blown mania or hypomania yet), we extended the age inclusion criterion beyond the typical age of onset based on available studies on time to diagnosis. For more details on inclusion/exclusion criteria, see online Supplementary note 1. The study was approved by the Ethics Committee of the Medical Faculty of the Technische Universität Dresden (No: EK290082014), as well as local ethics committees at each study site. We obtained a written informed consent after comprehensive information about study aims and procedures. Additionally, parents of adolescents gave their informed consent about their children's participation.
MRI acquisition, preprocessing and quality assessment
We acquired high-resolution structural T1-weighted images using Siemens Magnetom MR scanners at 6 sites (Trio, Skyra, Prisma) and a Philips Achieva scanner at 1 site. We standardized the pulse sequence parameters across all sites to the extent permitted by each platform. For a detailed description of the scanning protocol including the detail of MRI scanners, specific hardware configurations, and pulse sequence parameters, see Vogelbacher et al. (Reference Vogelbacher, Sommer, Schuster, Bopp, Falkenberg, Ritter and Jansen2021).
Prior to preprocessing, we performed the data acquisition and quality assessment according to the BipoLife study protocol (Vogelbacher et al., Reference Vogelbacher, Sommer, Schuster, Bopp, Falkenberg, Ritter and Jansen2021). Briefly, we analyzed the MRI images using the MRIQC tool (Esteban et al., Reference Esteban, Birman, Schaer, Koyejo, Poldrack and Gorgolewski2017). Two authors visually inspected the obtained reports of several metrics including a movement plot and a plot of the background noise. In this way, 23 subjects were excluded from further analysis due to strong movement (N = 18), ghosting (N = 1), or fold-over artifacts (N = 4).
We preprocessed the T1-weighted sMRI using Freesurfer 6.0 software integrated in our processing pipeline NICePype (Müller, Küttner, & Hannig, Reference Müller, Küttner and Hannig2015). We obtained regional cortical thicknesses and surface area values for 68 cortical brain areas (34 left/34 right) defined by the Desikan–Killiany atlas (Desikan et al., Reference Desikan, Ségonne, Fischl, Quinn, Dickerson, Blacker and Killiany2006) and 14 subcortical volumes (7 left/7 right) (Fischl et al., Reference Fischl, Salat, van der Kouwe, Makris, Ségonne, Quinn and Dale2004).
We performed a standardized quality control of the cortical and subcortical segmentations and parcellations according to the established protocols of the ENIGMA working group (http://enigma.ini.usc.edu/protocols/imaging-protocols). This included a visual inspection of the segmented regions using the internal and external surface methods, as well as statistical outlier detection. The outliers were subjected for further visual inspection. Three subjects did not pass the quality control or displayed major segmentation errors and were discarded.
Risk assessment instruments
We assessed the risk for the development of bipolar disorder using three state-of-the-art assessment instruments – the Bipolar At-Risk (BAR) criteria (Bechdolf et al., Reference Bechdolf, Ratheesh, Cotton, Nelson, Chanen, Betts and McGorry2014) and the extended BAR criteria (BARS; Fusar-Poli et al., Reference Fusar-Poli, De Micheli, Rocchetti, Cappucciati, Ramella-Cravaro, Rutigliano and Falkenberg2018), the Bipolar Prodrome Symptom Scale (BPSS-P; Correll et al., Reference Correll, Olvet, Auther, Hauser, Kishimoto, Carrión and Cornblatt2014), and the Early Phase Inventory for bipolar disorders (EPIbipolar; Leopold et al., Reference Leopold, Ritter, Correll, Marx, Özgürdal, Juckel and Pfennig2012).
BAR(S) criteria comprise a set of subthreshold clinical and behavioral symptoms as well as genetic risk. A person is assessed as having high risk if one or more risk syndromes are fulfilled: sub-threshold mania, sub-threshold depression, sub-threshold depression with genetic risk, mixed symptoms, or mood swings. BARS criteria showed an adequate prognostic accuracy of conversion to bipolar disorder (conversion rate 18.5% in N = 27 participants) in a longitudinal cohort (Fusar-Poli et al., Reference Fusar-Poli, De Micheli, Rocchetti, Cappucciati, Ramella-Cravaro, Rutigliano and Falkenberg2018). BPSS-P and EPIbipolar are semi-structured interviews. BPSS-P was developed based on the DSM-IV criteria for bipolar disorder and major depression and established rating scales for these conditions. BPSS-P combines all these criteria to a mania symptom index, depression symptom index, and general symptom index. It implies two at-risk states: attenuated mania symptom syndrome (AMSS) and genetic mania risk and deterioration syndrome (GMRDS). BPSS-P has good internal consistency, convergent validity, and inter-rater reliability (Correll et al., Reference Correll, Olvet, Auther, Hauser, Kishimoto, Carrión and Cornblatt2014). EPIbipolar contains elements from BPSS-P and additionally captures risk factors that have been identified through a systematic literature review, such as subsyndromal manic or depressive symptoms, mood swings, changes in sleep and circadian rhythm, anxiety, ADHD, specific character traits, stressful life events, or changing patterns of substance use (Leopold et al., Reference Leopold, Ritter, Correll, Marx, Özgürdal, Juckel and Pfennig2012). It defines three risk categories: no-risk, low-risk, and high-risk. For the purpose of this analysis, we pooled subjects from the low-risk and high-risk groups assessed by EPIbipolar, as these participants, unlike those from the no-risk group, displayed several clinically relevant risk factors or symptoms and are intended for targeted interventions in early recognition services. The term ‘no-risk’ group in EPIbipolar was originally established to describe the lack of need for a specialized clinical intervention in the participants with only minor risk factors (Leopold et al., Reference Leopold, Ritter, Correll, Marx, Özgürdal, Juckel and Pfennig2012). Of note, all recruited participants, even those who did not fulfill the criteria of any risk syndrome/group on any of the three risk instruments, displayed at least one known risk factor for bipolar disorder (see online Supplementary note 1). In research settings, this label might be misleading, as participants in the no-risk group might also display minor risk factors and are not to be confused with healthy controls. The final binary outcomes were as follows: any symptom syndrome/no symptom syndrome for BPSS-P; any risk group/no risk group for BARS; high-risk + low-risk groups/no-risk group for EPIbipolar (see also Table 1 for demographics). As we discarded subjects with missing data on corresponding assessment tools, the sample sizes for each of the three risk assessment tools varied (N BARS = 264, N BPSS−P = 276, N EPIbipolar = 273). For details on the risk assessment tools, see online Supplementary Table S1 and Pfennig et al. (Reference Pfennig, Leopold, Martini, Boehme, Lambert, Stamm and Bauer2020). All three instruments/criteria sets were obtained from the respective authors and can be administered after appropriate training. The administration of the complete risk assessment battery takes 2–3h.
*p ⩽ 0.05; **p ⩽ 0.01; ***p ⩽ 0.001. FDR, first-degree relatives of BD patients.
a Fisher–Freeman–Halton's exact test was used for variables with ⩾1 expected cell counts <5.
Machine learning classification
In accordance with a previous study of subjects with diagnosed bipolar disorder by the ENIGMA consortium (Nunes et al., Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020) and to increase reproducibility, we used a linear support vector machine (SVM) classifier with the hyperparameter C = 1 for the primary analysis. We performed independent binary classifications for each risk instrument (BPSS-P, BARS, and EPIbipolar). Using Scikit-learn 1.0 package for Python 3.8.3 (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel and Duchesnay2011), we utilized two cross-validation methods: 10-fold and leave-one-site-out (i.e. data from one study center was taken to be the test-data, while the training dataset included the data from all other centers). In each fold, we standardized features in the training and testing sets separately by removing the mean and scaling to unit variance using standard scaler (Scikit-learn 1.0 package, see above). We took the following measures to manage the imbalanced class distribution within the data: (A) we used a stratified cross-validation to ensure, that the class ratio in all folds stays approximately the same, (B) we used random oversampling of the minority class (Chawla, Bowyer, Hall, & Kegelmeyer, Reference Chawla, Bowyer, Hall and Kegelmeyer2002) in the training set, so that the class ratios in each fold was balanced. For the primary analysis, we used the 68 regional cortical thickness values as features and we performed both cross-validation methods (10-fold and leave-one-site-out), i.e. we trained six models altogether. As the class ratios for all three risk instruments were imbalanced, we used following two performance measures which are commonly used for imbalanced classification problems: Cohen's κ (i.e. the measure of agreement between the classifier and a random classifier relative to the frequency of classes, <0 no agreement, 0–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1 almost perfect agreement) (Landis & Koch, Reference Landis and Koch1977), and balanced accuracy [balanced accuracy = (sensitivity + specificity)/2]. Additionally, we report sensitivity and specificity. We do not report other common measures such as accuracy and area under receiver operating characteristic curve, as these are not suitable for imbalanced data (He & Ma, Reference He and Ma2013). As this was a population-based, observational study, the samples in each site were not balanced regarding participants at risk, some were even smaller than the recommended size N > 20 for a test set (Flint et al., Reference Flint, Cearns, Opel, Redlich, Mehler, Emden and Hahn2021; as well as see online Supplementary Table S2). For this reason, we report the performance in both 10-fold as well as leave-one-site-out cross-validations.
For risk assessment instruments achieving an above chance prediction (i.e. Cohen's κ > 0 and lower confidence interval > 0), we assessed the possible effects of confounds using post-hoc statistical tests comparing the correctly and incorrectly classified subjects. This is a more valid approach to account for possible confounders than regressing out covariates prior to analysis, which would disrupt the train/test separation (Pereira, Mitchell, & Botvinick, Reference Pereira, Mitchell and Botvinick2009). We also report the post-hoc tests for the leave-one-site-out cross-validation by BPSS-P, where the lower confidence interval slightly crossed the zero boundary. Given the above-mentioned limitations of the leave-one-site-out cross-validation (low sample size of some sites, imbalanced classes), we consider both measures relevant. We accounted for following confounds: age, sex, medication (yes/no), recruitment pathway (early recognition services/depression/ADHD), smoking status (never smoked/current smoker/past smoker), present cannabis use (no use/<1 per month/~1 per month/2–9 per month/⩾10 per month), lifetime cannabis use (no use/<1x month/~1x month/2–9x month/⩾10x month), site and scanner type (for the list of sites and scanner types see above).
We estimated the magnitude of contribution of brain regions to the SVM classification using SVM coefficients. Coefficients of a linear classifier can be interpreted as relative measure of feature importance (Pereira et al., Reference Pereira, Mitchell and Botvinick2009) for the classification process. Note that this is not to say that a highly weighted feature contains necessarily a lot of information about the target class (Haufe et al., Reference Haufe, Meinecke, Görgen, Dähne, Haynes, Blankertz and Bießmann2014). We used the freesurfer_statsurf_display library (https://chrisadamsonmcri.github.io/freesurfer_statsurf_ display) to visualize the results.
Secondary analyses
To investigate whether a lower feature/sample size ratio might improve classification performance, we selected 20 features based on the available literature from other relevant large-scale multicenter studies and included these in our pre-registration. We chose 20 features in order to approach the similar ratio of features as the prior study on bipolar disorder using SVM by Nunes et al. (Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020), which reported having 20 times more participants, than features. We selected those features from another large-scale ENIGMA study of bipolar disorder and healthy controls by Hibar et al. (Reference Hibar, Westlye, Doan, Jahanshad, Cheung, Ching and Andreassen2018) which identified a pattern of significant reductions of cortical thickness in frontal, temporal, and parietal regions in a sample of 6503 participants and bipolar patients. We selected the 20 features displaying the highest effect sizes in that study (see online Supplementary note 2 for the list of features).
In order to better compare the performance of the SVM on our sample of help-seeking individuals at risk and patients with established disease published by Nunes et al. (Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020), we also performed the classification using the same feature set of 150 features including 68 regional cortical thickness and 68 surface area values as well as volumes of 14 subcortical features plus the estimated total intracranial volume.
Lastly, we investigated whether hyperparameter optimization using a nested cross-validation would improve the results. In each fold, we divided the train set into train and test subsets once more and ran multiple nested SVM classifications with different SVM regularization parameters C (1 × 10−5, 1 × 10−4, 1 × 10−3, 1 × 10−2, 1 × 10−1, 1, 10, and 100) (grid search method). We selected the best possible model according to the achieved balanced accuracy. Finally, we tested the selected model on the unseen test data from the primary loop.
Results
Demographics
For detailed demographics, see Table 1. The participants who fulfilled any risk syndrome according to BPSS-P did not differ from those not fulfilling any risk syndrome in any of the measured variables. The participants who fulfilled any risk syndrome according to BARS were more likely to take medication (χ2 = 4.608, p = 0.032), to smoke (χ2 = 6.008, p = 0.05), and suffer from diagnosed depression, but less likely to suffer from ADHD (χ2 = 23.149, p ≤ 0.001) and were more likely to have attended high-school (χ2 = 13.789, p = 0.032) than those not fulfilling any risk syndrome. The participants who fulfilled any risk syndrome according to EPIbipolar were more likely to be female (χ2 = 3.894, p = 0.048), to take medication (χ2 = 6.909, p = 0.009), to smoke (χ2 = 6.036, p = 0.049), and suffer from diagnosed depression, but less likely to suffer from ADHD (χ2 = 23.149, p ≤ 0.001), than those not fulfilling any risk syndrome. The participants removed due to movement during the scan and quality control did not differ from those in the final dataset in the proportion of any of the risk syndromes: BPSS-P (df = 1, χ2 = 0.004, p = 0.949), BARS (df = 1, χ2 = 0.412, p = 0.521), and EPIbipolar (df = 2, χ2 = 1.092, p = 0.579).
Primary analysis
Performance measures of the classification for all three risk instruments (BPSS-P, BARS, and EPIbipolar) using all regional cortical thickness values as features are given in Table 2. Only for BPSS-P, both performance measures reached levels above chance for the 10-fold CV approach with following performance: Cohen's κ 0.235 (95% CI 0.11–0.361), balanced accuracy 63.1% (95% CI 55.9–70.3), sensitivity 48% (95% CI 36–60), and specificity 78.2% (95% CI 72.5–83.9). The correctly and incorrectly classified subjects did not differ in age (df = 274, t = 0.987, p = 0.114), sex (df = 1, χ2 = 0.152, p = 0.698), medication (df = 1, χ2 = 0.068, p = 0.795), recruitment pathway (df = 2, χ2 = 0.673, p = 0.714), first-degree relatives (df = 1, χ2 = 0.334, p = 0.563), smoking status (df = 2, χ2 = 2.254, p = 0.324), cannabis use lifetime (Fisher–Freeman–Halton's exact test p = 0.28), site (Fisher–Freeman–Halton's exact test p = 0.119), and scanner type (Fisher–Freeman–Halton's exact test p = 0.225). There was a significant difference in the present cannabis use (Fisher–Freeman–Halton's exact test p = 0.043), however, using residuals of cortical features after regressing out present cannabis use resulted in a comparable performance Cohen's κ 0.240 (95% CI 0.102–0.379), balanced accuracy 62.6% (95% CI 55.3–69.8).
In the leave-one-site-out cross-validation, the classifier based on BPSS-P achieved Cohen's κ 0.128 (95% CI −0.069 to 0.325), balanced accuracy 56.2% (95% CI 44.6–67.8), sensitivity 33% (95% CI 12.4–53.7), and specificity 79.4% (95% CI 72.3–86.4). The correctly and incorrectly classified subjects did not differ in age (df = 274, t = 0.523, p = 0.601), sex (df = 1, χ2 < 0.001, p = 0.994), medication (df = 1, χ2 = 2.268, p = 0.132), recruitment pathway (df = 2, χ2 = 2.951, p = 0.229), first-degree relatives (df = 1, χ2 = 2.125, p = 0.145), smoking status (df = 2, χ2 = 3.595, p = 0.166), cannabis use present (Fisher–Freeman–Halton's exact test p = 0.281), cannabis use lifetime (Fisher–Freeman–Halton's exact test p = 0.518), site (Fisher–Freeman–Halton's exact test p = 0.905), and scanner type (Fisher–Freeman–Halton's exact test p = 0.694).
See Table 2 for the summary of performance measures.
Secondary analyses
Both literature-derived feature selection of 20 regional cortical thickness features, as well as an extended feature set including whole-brain regional surface area and volumes of subcortical regions did not yield significantly higher accuracies, as the confidence intervals overlapped with those from the primary analysis (see Table 3 for the summary of classification metrics). The lower difference in performance measures between the 10-fold and the leave-one-site-out cross-validation using the 20 regional cortical features rather than all regional cortical values by BPSS-P suggests a non-significant trend toward better model validity when using the 20 cortical features.
Hyperparameter optimization
Using hyperparameter optimization, we achieved Cohen's κ 0.212 (95% CI 0.123–0.302), balanced accuracy 62.3% (95% CI 56.7–68.0), sensitivity 48% (95% CI 37.7–59.0), and specificity 76.4% (95% CI 72.4–80.3) in a 10-fold cross-validation and Cohen's κ 0.136 (95% CI −0.075 to 0.346), balanced accuracy 57.1% (95% CI 43.6–70.6), sensitivity 33.7% (95% CI 10.2–57.3), and specificity 80.4% (95% CI 72.4–88.4) in the leave-one-site-out cross-validation. The mostly selected C parameter was 100 (7 out of 10 and 4 out of 7) for 10-fold and leave-one-site-out, respectively.
SVM coefficients
The mean over folds of the absolute values of the SVM coefficients by feature (brain region) for the BPSS-P, 10-fold cross-validation, and whole-brain regional cortical thickness features are depicted in Fig. 1. For the values of all coefficients, see online Supplementary Table S3.
Discussion
The linear SVM classifier detected individuals with increased estimated risk for bipolar disorder as defined by the BPSS-P interview with a Cohen's κ of 0.227/0.141 and balanced accuracy of 63.1/56.2% (based on pooled sample and leave-one-site-out cross-validations, respectively). Precuneus, inferior frontal gyrus, and posterior cingulate cortex ranked among the highest contributing features according to SVM coefficients. SVM could not detect participants with increased risk for bipolar disorder based on EPIbipolar or BARS criteria. Whole-brain cortical thickness yielded the highest accuracy, whereas reducing the features based on literature, or expanding the features by surface area or subcortical volumes did not change the performance significantly given the large confidence intervals. However, there might be a trend toward better model validity using fewer cortical thickness features.
Our results suggest that young participants at risk of bipolar disorder according to the BPSS-P display distinct structural brain features that permit better-than-chance classification. Importantly, using both the pooled sample (10-fold cross-validation), as well as leave-one-site out cross-validation, we achieved accuracies comparable to the previous multicenter study by Nunes et al. (Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020) that differentiated patients with manifest bipolar disorder from healthy controls with balanced accuracies of 65.23% (95% CI 63.47–67.00) and 58.67% (95% CI 56.70–60.63), respectively (Nunes et al., Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020). Compared to their study, the 95% confidence intervals in our study were considerably wider, which was to be expected given that our sample was more than 10 times smaller (276 v. 3020 participants). Larger sample sizes tend to yield more stable performance (Nieuwenhuis et al., Reference Nieuwenhuis, van Haren, Hulshoff Pol, Cahn, Kahn and Schnack2012). Post-hoc tests suggested an effect of present cannabis use on the classification using 10-fold cross-validation; however, regressing out present cannabis use did not impair the performance. Moreover, there was no such effect in the leave-one-site-out validation. Other demographic variables did not show an effect on the classification (see also online Supplementary note 3). As such, this would be consistent with the notion that differences in brain structure in bipolar disorder are not a result of the disorder but are a pre-morbid risk factor potentially related to genetics. On the other hand, as the age of participants in our sample was higher than the typical age of onset of bipolar disorder, we might have included older participants with a yet undiagnosed bipolar disorder. Those participants would have possibly displayed more structural differences than participants before the age of onset, which might in turn have led to higher classification accuracies.
Unlike previous attempts to detect participants with genetic risk within family cohorts (Hajek et al., Reference Hajek, Cooke, Kopecek, Novak, Hoschl and Alda2015), we estimated the individual risk state using state-of-the-art screening instruments, which better address the clinical realities of early recognition centers. Given the variable estimated transition rates of 4.2–22.4% by known genetic risk (Hafeman et al., Reference Hafeman, Merranko, Goldstein, Axelson, Goldstein, Monk and Birmaher2017; Kerner, Reference Kerner2014; Post et al., Reference Post, Altshuler, Kupka, McElroy, Frye, Rowe and Nolen2018), there is a need for more differentiated risk assessment including state markers in order to provide targeted interventions. Moreover, most people seeking for early recognition services do not have genetic risk (12.9% or 15 out of 116 recruited via the early recognition pathway). BPSS-P provides a conservative risk assessment, selecting persons displaying an AMSS or a GMRDS. In total, 20.3% of the participants screened positive on one of these syndromes.
Surprisingly, SVM could not detect participants at risk estimated using EPIbipolar, although we detected significant differences in cortical thickness between the high-risk and no-risk individuals in our previous study (Mikolas et al., Reference Mikolas, Bröckel, Vogelbacher, Müller, Marxen, Berndt and Pfennig2021). Given similar sample size (previous study N = 263), we pooled the individuals in the high-risk and low-risk groups in order to allow for a binary classification. As a result, the no-risk group had only 32 participants, which might have had a negative influence on the learning phase. In a post hoc analysis (see online Supplementary note 4), a three-category classification using all three risk groups did not yield an above chance classification. However, after removing the low-risk group, we obtained a balanced accuracy of 60.9/55.5% (10-fold/leave-one-site-out). This suggests that whereas in a hypothesis-driven region-of-interest analysis, EPIbipolar selected participants displaying significantly thinner cortex in the left pars opercularis (Mikolas et al., Reference Mikolas, Bröckel, Vogelbacher, Müller, Marxen, Berndt and Pfennig2021), BPSS-P selected participants displaying widespread structural alterations enabling for more accurate, single subject classification using machine learning. Interestingly, in our above-mentioned previous study (Mikolas et al., Reference Mikolas, Bröckel, Vogelbacher, Müller, Marxen, Berndt and Pfennig2021), the pars opercularis was not significantly thinner in participants scoring positive in BPSS-P; however, the low p value might have suggested a non-significant trend. Additionally, among the participants scoring positive on any risk criterion in both EPIbipolar and BARS, those with depression were more represented comparing to BPSS-P. As a result, more participants with unipolar depression might have been selected by EPIbipolar and BARS, which might have impeded the classification. Indeed, the cortical thickness differences in major depression seem to be less prominent than in bipolar disorder (Ching et al., Reference Ching, Hibar, Gurholt, Nunes, Thomopoulos and Abé2020; Schmaal et al., Reference Schmaal, Hibar, Sämann, Hall, Baune, Jahanshad and Veltman2017). Finally, unlike in BPSS-P, the participants who fulfilled any risk syndrome according to BARS or EPIbipolar differed from those not fulfilling any risk syndrome in several other demographic variables which might have confounding effects on cortical thickness, such as medication or smoking.
The regions with highest contribution toward the classification (i.e. with the highest values of SVM coefficients) partially overlapped with those contributing to classification of patients with manifest bipolar disorder and healthy controls obtained by Nunes et al. (Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020) in their study. Of 33 cortical thickness weights reported by Nunes et al., 69.7% hat the same sign as in our study. Notably, the inferior frontal gyrus is a region structurally and functionally associated with the genetic risk for bipolar disorder (Hajek et al., Reference Hajek, Cullis, Novak, Kopecek, Blagdon, Propper and Alda2013; Roberts et al., Reference Roberts, Green, Breakspear, McCormack, Frankland, Wright, Levy, Lenroot, Chan and Mitchell2013, Reference Roberts, Lord, Frankland, Wright, Lau, Levy and Breakspear2017). This suggests a consistent structural pattern of individuals at risk estimated by BPSS-P and patients with manifest disease or genetic risk. A direct comparison, however, of feature weights between Nunes et al. and our study is to be viewed with caution because of the complex covariance structure within the feature set (Haufe et al., Reference Haufe, Meinecke, Görgen, Dähne, Haynes, Blankertz and Bießmann2014), the difference in the number and type of features, and the limited number of training samples in our study. While multivariate machine learning techniques have the potential to optimize prediction accuracies, univariate, between-group comparisons are more straight forward to interpret in terms of relative feature importance, as we have done in our previous work (Mikolas et al., Reference Mikolas, Bröckel, Vogelbacher, Müller, Marxen, Berndt and Pfennig2021).
The achieved accuracy is not sufficient to suggest sMRI as a single risk assessment method. Even using the best performing model, among the subjects, who did not fulfill any risk criterion, 21.8% were classified as positive (type I error). Among the subjects at risk, 51.8% were classified as negative (type II error). Even feature selection approaches or hyperparameter optimization did not achieve a more accurate classification. Earlier machine learning neuroimaging studies reported accuracies well beyond the 80% boundary that roughly demarks the clinical utility (Nunes et al., Reference Nunes, Nunes, Schnack, Ching, Agartz, Akudjedu and Hajek2020; Radua & Carvalho, Reference Radua and Carvalho2021). However, many earlier studies did not comply with recently established criteria (Dwyer et al., Reference Dwyer, Falkai and Koutsouleris2018), for example, by using insufficiently small samples [i.e. N < 130 (Nieuwenhuis et al., Reference Nieuwenhuis, van Haren, Hulshoff Pol, Cahn, Kahn and Schnack2012)] or omitting validation samples (Radua & Carvalho, Reference Radua and Carvalho2021). Studies that used validation samples generally reported lower accuracies.
Differentiating between healthy, non-help-seeking persons and help-seeking persons with higher risk for bipolar disorder might lead to higher accuracies. For a potential clinical application, however, this might be misleading, as clinicians are required to make predictions by persons who already display symptoms and therefore seek for help. Thus, our population-based sample of help-seeking individuals reflects the real clinical setting better than using a healthy-control group. The very fact that we chose a conservative approach by comparing only help-seeking individuals and yet are able to obtain a clear above-chance prediction of the score in an established risk instrument with mere structural neuroimaging data demonstrates the potential of sMRI in risk stratification. A major advantage over functional MRI is that structural T1 images are part of any standard clinical exam and would not invoke additional costs for specialized scanning protocols.
Overall, our results suggest that in order to achieve clinically meaningful predictions, future approaches using brain imaging should aim at integrating multimodal data such as clinical data, such as body mass index (McWhinney et al., Reference McWhinney, Abé, Alda, Benedetti, Bøen and Mar Bonnin2021) or genetics, rather than focusing on brain structure only. An ‘augmentation’ of clinical judgments of trained professional by a machine learning-based algorithm might be a realistic scenario. Koutsouleris et al. (Reference Koutsouleris, Dwyer, Degenhardt, Maj, Urquijo-Castro and Sanfelici2021) showed in individuals with psychosis risk, that in a multimodal application, sMRI might contribute to the overall prediction by several percent. As our study suggests, sMRI, especially cortical thickness, might contribute to the diagnostic performance of such algorithms aimed at estimating the risk for bipolar disorder.
An important limitation that needs to be addressed in future studies was the use of the estimated risk as outcome. The concept of high risk for bipolar disorder is still in development (Keramatian, Chakrabarty, Saraf, & Yatham, Reference Keramatian, Chakrabarty, Saraf and Yatham2021). Participants scoring positive on those risk criteria might benefit from a more intensive diagnostic and prevention process. However, in order to further individualize the risk prediction, larger, longitudinal studies with sufficient number of participants who develop a first manic episode should be performed in the future.
Lower occurrence of ADHD in the high-risk group was due to the distribution of risk factors within the three different recruitment pathways. Although ADHD as a risk factor enabled the participants to enter the study through all three recruitment pathways, the risk factor ADHD was ‘enriched’ in the overall sample due to in- and outpatients entering the study through the ‘ADHD’ recruitment pathway. However, these participants displayed fewer additional risk factors, so that most did not fulfill the criteria for the higher risk groups.
An interesting research objective for future studies would be to include participants with borderline personality disorder, as these might be hard to clinically distinguish from bipolar disorder in its initial or at-risk state. Especially the question whether people that transition to different disorders also differ in brain structure would be highly relevant.
In summary, we show that machine learning techniques can detect brain structural alterations in young individuals at risk for bipolar disorder with a performance comparable to previous studies of patients with manifest disease and healthy controls. Whole-brain cortical thickness might be superior to other structural brain features in predicting the risk to develop bipolar disorder. Future studies should aim to improve the performance of predictive models for individuals at risk by using larger cohorts and multimodal data. Even more sophisticated machine learning methods or methods of feature extraction may contribute to clinically meaningful predictions. Our own study may contribute to this effort in the future (Böhle, Eitel, Weygandt, & Ritter, Reference Böhle, Eitel, Weygandt and Ritter2019).
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291723001319
Author contributions
A. P., M. B., P. Ritter, and A. J. designed the study. K. B., C. B., J. M., A. J. F., T. E., A. Rau, T. K., I. F., M. L., G. L., C. M., V. K., K. L., A. B., A. Reif, S. M., T. S., F. B., J. F., G. J., V. F., and C. U. C. participated in the patient recruitment. P. M., M. M., P. Riedel, and C. V. performed the MRI data analyses and statistics. P. M., M. M., P. Riedel, F. H., and A. P. wrote the article. M. M., K. B., C. B., J. M., A. J. F., T. E., A. Rau, T. K., I. F., M. L., G. L., C. M.,V. K., K. L., A. B., A. Reif, S. M., T. S., F. B., J. F., G. J., V. F., C. U. C., M. B., and P. Ritter revised it critically for important intellectual content. All of the authors reviewed and approved the manuscript for publication.
Financial support
Early-BipoLife is funded by the Federal Ministry of Education and Research (BMBF, grant numbers: 01EE1404A, 01EE1404E, and 01EE1404F). M. M. was supported by the Deutsche Forschungsgemeinschaft (DFG grant Nos. 178833530 [SFB 940] and 402170461 [TRR 265]).
Conflict of interest
K. Leopold has been a consultant and/or advisor to or has received honoraria from: Janssen/J&J, Lundbeck, Otsuka, Recordati, and ROVI. She has received grant support from Janssen/J&J and Otsuka. All other authors declared no conflict of interest.
Ethical standards
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.