Psychosis
Psychosis is a mental illness characterised by hallucinations, delusions and thought disorder. The median lifetime prevalence of psychosis is around 8 per 1000 of the global population.Reference Moreno-Küstner, Martín and Pastor1 Psychotic disorders, including schizophrenia, are in the top 20 leading causes of disability worldwide.2 People with psychosis have heterogeneous outcomes. More than 40% fail to achieve symptomatic remission.Reference Lally, Ajnakina, Stubbs, Cullinane, Murphy and Gaughran3 At present, clinicians struggle to predict long-term outcome in individuals with first-episode psychosis (FEP).
Prediction modelling
Prediction modelling has the potential to revolutionise medicine by predicting individual patient outcome.Reference Darcy, Louie and Roberts4 Early identification of those with good and poor outcomes would allow for a more personalised approach to care, matching interventions and resources to those most at need. This is the basis of precision medicine. Risk prediction models have been successfully employed clinically in many areas of medicine; for example, the QRISK tool predicts cardiovascular risk in individual patients.Reference Hippisley-Cox, Coupland and Brindle5 However, within psychiatry, precision medicine is not yet established within clinical practice. In FEP, precision medicine could enable rapid stratification and targeted intervention, thereby decreasing patient suffering and limiting treatment associated risks such as medication side-effects and intrusive monitoring.
Salazar de Pablo et al recently undertook a broad systematic review of individualised prediction models in psychiatry.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver6 They found clear evidence that precision psychiatry has developed into an important area of research, with the greatest number of prediction models focusing on outcomes in psychosis. However, the field is hindered by methodological flaws such as lack of validation. Further, there is a translation gap, with only one study considering implementation into clinical practice. Systematic guidance for the development, validation and presentation of prediction models is available.Reference Steyerberg and Vergouwe7 Further, the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement sets standards for reporting.Reference Collins, Reitsma, Altman and Moons8 Models that do not adhere to these guidelines result in unreliable predictions, which may cause more harm than good in guiding clinical decisions.Reference Wynants, Van Calster, Collins, Riley, Heinze and Schuit9 Salazar de Pablo et al ‘s review was impressive in scope, but necessarily limited in detailed analysis of the specific models included.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver6 Systematic reviews focusing on predicting the transition to psychosisReference Studerus, Ramyead and Riecher-Rössler10,Reference Rosen, Betz, Schultze-Lutter, Chisholm, Haidl and Kambeitz-Ilankovic11 and relapse in psychosis have also been published.Reference Sullivan, Northstone, Gadd, Walker, Margelyte and Richards12 In our present review, we focus on FEP with the aim to systematically review and critically appraise the prediction models for the prediction of poor outcomes.
Method
We designed this systematic review in accordance with the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS).Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman13 A protocol for this study was published with the International Prospective Register of Systematic Reviews (PROSPERO), under registration number CRD42019156897.
We developed the eligibility criteria under the Population, Index, Comparator, Outcome, Timing and Setting (PICOTS) guidance (see Supplementary Material available at https://doi.org/10.1192/bjp.2021.219). A study was eligible for inclusion if it utilised a prospective design, including patients diagnosed with FEP, and developed, updated or validated prognostic prediction models for any possible outcome, in any setting. We excluded non-English language studies, those where the full text was not available, those involving diagnostic prediction models and those where the outcome predicted was ≤3 months from baseline as we were interested in longer-term prediction.
We searched PubMed, PsycINFO, EMBASE, CINAHL Plus, Web of Science Core Collection and Google Scholar, from inception up to 28 January 2021. In addition, we manually checked references cited in the systematically searched articles. The search terms were based around three themes: ‘Prediction’, ‘Outcome’ and ‘First Episode Psychosis’ terms. The full search strategy is available in the Supplementary Material. Two reviewers (R.L. and L.T.) independently screened the titles and abstracts. Full-text screening was completed by three independent reviewers (R.L., P.K.M. and S.P.L.). Disagreements were resolved by consensus.
Data extraction was conducted independently by two reviewers (R.L. and S.P.L.), following recommendations in the CHARMS checklist.Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman13 From all eligible studies, we collected information on study characteristics, methodology and performance. Study characteristics collected included first author name, year, region, whether the study was multicentre, study type, setting, participant description, outcome, outcome timing, predictor categories and number of models presented. Methodology considered sample size, events per variable (EPV), number of events in validation data-set, number of candidate and retained predictors, methods of variable selection, presence and handling of missing data, modelling strategies, shrinkage, validation strategies (see below), whether models were recalibrated, if clinical utility was assessed and whether the full models were presented. Steyerberg and Harrell outline a hierarchy of validation strategies from apparent (which assesses model performance on the data used to develop it and will be severely optimistic) to internal (via cross-validation or bootstrapping), internal–external (e.g. validation across centres in the same study) and external validation (to assess if models generalise to related populations in different settings).Reference Steyerberg and Harrell14 Apparent, internal and internal–external validation use the derivation data-set only, whereas external validation requires the addition of a validation data-set. Performance for the best-performing model per outcome in each article was considered by model validation strategy, including model discrimination (reported as the C-statistic, which is equal to the area under the receiver operating characteristic curve for binary outcomes), calibration, other global performance measures and classification metrics. If not reported, where possible, the balanced accuracy (sensitivity + specificity / 2) and the prognostic summary index (positive + negative predictive value – 1) were calculated.
Two reviewers (R.L. and S.P.L.) independently assessed the risk of bias in included studies by using the Prediction Model Risk Of Bias Assessment Tool (PROBAST), a risk-of-bias assessment tool designed for systematic reviews of diagnostic or prognostic prediction models.Reference Wolff, Moons, Riley, Whiting, Westwood and Collins15,Reference Moons, Wolff, Riley, Whiting, Westwood and Collins16 We considered all models reported in each article and assigned an overall rating to the article. PROBAST uses a structured approach with signalling questions across four domains: ‘participants’, ‘predictors’, ‘outcome’ and ‘statistical analysis’. Signalling questions are answered ‘yes’, ‘probably yes’, ‘no’, ‘probably no’ or ‘no information’. Answering ‘yes’ indicates a low risk of bias, whereas answering ‘no’ indicates high risk of bias. A domain where all signalling questions are answered as ‘yes’ or ‘probably yes’ indicates low risk of bias. Answering ‘no’ or ‘probably no’ flags the potential for the presence of bias, and reviewers should use their personal judgement to determine whether issues identified have introduced bias. Applicability of included studies to the review question is also considered in PROBAST.
We reported our results according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 Statement (see Supplementary Material).Reference Page, Moher, Bossuyt, Boutron, Hoffmann and Mulrow17
Results
Systematic review of the literature yielded 2353 records from database searches and 67 from additional sources. After removal of duplicates, 1543 records were screened. Of these, 82 full texts were reviewed, which resulted in 13 studies meeting criteria for inclusion in our qualitative synthesis (Fig. 1).Reference Ajnakina, Agbedjro, Lally, Forti, Trotta and Mondelli18–Reference Puntis, Whiting, Pappa and Lennox30
Study characteristics are summarised in Table 1. The 13 included studies, comprising a total of 19 different patient cohorts, reported 31 different prediction models. Dates of publication ranged from 2006 to 2021. Twelve studies (92%) recruited participants from Europe, with two studies (15%) also recruiting participants from Israel and one study (8%) from Singapore. Over two-thirds (n = 9) of studies were multicentre. Ten studies (77%) included participants from cohort studies, three studies (23%) included participants from randomised controlled trials and two studies (15%) included participants from case registries. Two studies (15%) included only out-patients, four (31%) included in-patients and out-patients, and the rest did not specify their setting. Cohort sample size ranged from 47 to 1663 patients. The average age of patients ranged from 21 to 28 years, and 49–77% of the cohorts were male. Where specified, the average duration of untreated psychosis ranged from 34 to 106 weeks. Ethnicity was reported in eight studies (62%), with the percentage of Black and minority ethnic patients in the cohorts ranging from 4 to >75%. The definition of FEP was primarily non-affective psychosis in the majority of patient cohorts, with the minority also including affective psychosis, and two cohorts also including drug-induced psychosis. All but one study (92%) considered solely sociodemographic and clinical predictors. A wide range of outcomes were assessed across the 13 included studies, including symptom remission in five studies (38%), global functioning in five studies (38%), vocational functioning in three studies (23%), treatment resistance in two studies (15%), hospital readmission in two studies (15%) and quality of life in one study (8%). All of the outcomes were binary. The follow-up period of included studies ranged from 1 to 10 years.
DUP, duration of untreated psychosis; FEP, first-episode psychosis; EET, employment, education or training; GAF, Global Assessment of Functioning; DAS, Disability Assessment Schedule.
Study prediction-modelling methodologies are outlined in Table 2. Nine (69%) studies pertained solely to model development, with the highest level of validation reported being apparent validity in four of the studies, internal validity in three of the studies and internal–external validity (via leave-one-site-out cross-validation) in two of the studies. The remaining four (31%) studies also included a validation cohort and reported external validity. High dimensionality was common across the study cohorts, with the majority having a very low EPV ratio and up to 258 candidate predictors considered. Some form of variable selection was used in the majority (62%) of studies. The number of events in the external validation cohort ranged from 23 to 173. All of the studies had missing data. Six studies (46%) used complete-case analysis, five (38%) studies used single imputation and the remaining two (15%) studies applied multiple imputation.
EPV, events per variable; LASSO, least absolute shrinkage and selection operator; MLE, maximum likelihood estimation.
The most common modelling methodology was logistic regression fitted by maximum likelihood estimation, followed by logistic regression with regularisation. Only two studies used machine learning methods, both via support vector machines. Just over half of the studies (54%) did not use any variable shrinkage, and only three (23%) studies recalibrated their models based on validation to improve performance. The full model was presented in seven (54%) studies. Only two (15%) studies assessed clinical utility.
The performance of the best model per study outcome grouped by method of validation to allow for appropriate comparisons is reported in Table 3. For the five studies (38%) reporting only apparent validity, two reported a measure of discrimination and only one considered calibration. For the seven (54%) studies reporting internal validation performance, four reported discrimination with a C-statistic ranging from 0.66 to 0.77, and four reported calibration. For the three (23%) studies reporting internal–external validation, only one study considered discrimination with a C-statistic, which ranged from 0.703 to 0.736 across each of its four models. None of the studies reporting internal–external validation considered any measure of calibration. All four (31%) studies reporting external validation considered model discrimination, with C-statistics ranging from 0.556 to 0.876. However, only two of these studies considered calibration. Table 3 also records any global performance metrics, including the Brier score and McFadden's pseudo-R 2, both of which incorporate aspects of discrimination and calibration. Various classification metrics were reported across the study models, but it is difficult to make any meaningful comparisons between these alone, without considering the models’ corresponding discrimination and calibration metrics, which were not universally reported.
PPV, positive predictive value; NPV, negative predictive value; PSI, prognostic summary index; EET, employment, education or training; GAF, Global Assessment of Functioning; DAS, Disability Assessment Schedule.
We applied the PROBAST tool to the 31 different prediction models across the 13 studies in our systematic review, and determined an overall risk-of-bias rating for each study, as summarised in Supplementary Table 1. The majority (85%) of studies had an overall ‘high’ risk of bias. In each of these studies, the risk of bias was rated ‘high’ in the analysis domain, with one study also having a ‘high’ risk of bias in the predictors domain. The main reasons for the ‘high’ risk of bias in the analysis domain were insufficient participant numbers and consequently low EPV, inappropriate methods of variable selection including via univariable analysis, a lack of appropriate validation with only apparent validation, an absence of reported measures of discrimination and calibration, and inappropriate handling of missing data by either complete-case analysis or single imputation. Two studies, Leighton et alReference Leighton, Krishnadas, Upthegrove, Marwaha, Steyerberg and Broome29 and Puntis et al,Reference Puntis, Whiting, Pappa and Lennox30 were rated overall ‘low’ risk of bias. These studies considered symptom remission and psychiatric hospital readmission outcomes, respectively. Both studies externally validated their prediction model and considered its clinical utility. However, neither study considered the implementation of the prediction model into actual clinical practice. When we assessed the 13 included studies according to PROBAST applicability concerns, all of the studies were considered overall ‘low’ concern. This is indicative of the broad scope of our systematic review.
Discussion
Our systematic review identified 13 studies reporting 31 prognostic prediction models for the prediction of a wide range of clinical outcomes. The majority of models were developed via logistic regression. There were several methodological limitations identified, including a lack of appropriate validation, issues with handling missing data and a lack of reporting of calibration and discrimination measures. We identified two studies with models at low risk of bias as assessed with PROBAST, both of which externally validated their models.
Principal findings in context
Our systematic review found no consistent definition of FEP across the different cohorts used for developing and validating prediction models. A lack of an operational definition for FEP within clinical and research settings has previously been identified as major a barrier to progress.Reference Breitborde, Srihari and Woods31 The majority of cohorts in our systematic review included only individuals with non-affective psychosis, with a minority also including affective psychosis. In contrast, early intervention services typically do not make a distinction between affective and non-affective psychosis in those that they accept onto their service.32 As such, there may be issues with generalisability of prediction models developed in cohorts with solely non-affective psychosis to real-world clinical practice.
A wide range of different outcomes were predicted by the FEP models, including symptom remission, global functioning, vocational functioning, treatment resistance, hospital readmission and quality-of-life outcomes. This is reflective of the fact that recovery from FEP is not readily distilled down to a single factor such as symptom remission. Meaningful recovery is represented by a constellation of multidimensional outcomes unique to each individual.Reference Jääskeläinen, Juola, Hirvonen, McGrath, Saha and Isohanni33 We should engage people with lived experience, to ensure that prediction models are welcomed and are predicting outcomes most relevant to the people they are for.
All of the prediction models were developed in populations from high-income countries, and only three studies included participants from countries outside of Europe, an issue not unique to FEP research. Consequently, it is currently unknown how prediction models for FEP would generalise to low-income countries. Prediction models may have considerable benefit in low-income countries, where almost 80% of patients with FEP live, but where mental health support is often scarce.Reference Singh and Javed34 Prediction models could help prioritise the appropriate utilisation of limited healthcare resources.
Only one study considered predictor variables other than clinical or sociodemographic factors. In this study, the additional predictors did not add significant value.Reference de Nijs22 In recent years, substantial progress has been made in elucidating the pathophysiological mechanisms underpinning the development of psychosis. We now recognise important roles for genetic factors, neurodevelopmental factors, dopamine and glutamate.Reference Lieberman and First35 Prediction model performance may be improved by the incorporation of these biologically relevant disease markers as predictor variables. However, the cost–benefit aspect of adding more expensive and less accessible disease markers must be carefully considered, especially if models are to be utilised in settings where resources are more limited.
Machine learning can be operationally defined as ‘models that directly and automatically learn from data’. This is in contrast to regression models, which ‘are based on theory and assumptions, and benefit from human intervention and subject knowledge for model specification’.Reference Christodoulou, Ma, Collins, Steyerberg, Verbakel and Van Calster36 Just two studies used machine learning techniques for their modelling.Reference de Nijs22,Reference Koutsouleris, Kahn, Chekroud, Leucht, Falkai and Wobrock26 The rest of the studies used logistic regression. We were unable to make any comparison between the discrimination and calibration ability of the two studies that used machine learning and the other studies, because these metrics were not provided. However, a recent systematic review found no evidence of superior performance of clinical prediction models that use machine learning methods over logistic regression.Reference Christodoulou, Ma, Collins, Steyerberg, Verbakel and Van Calster36 In any case, the distinction between regression models and machine learning has been viewed to be artificial. Instead, algorithms may exist ‘along a continuum between fully human-guided to fully machine-guided data analysis’.Reference Beam and Kohane37 An alternative comparison may be between linear and non-linear classifiers. Only one study used a non-linear classifier,Reference Koutsouleris, Kahn, Chekroud, Leucht, Falkai and Wobrock26 but again we were unable to gain meaningful insights into its relative performance because appropriate metrics were not provided.
A principal finding from our systematic review is the presence of methodological limitations across the majority of studies. Steyerberg et al outline four key measures of predictive performance that should be assessed in any prediction-modelling study: two measures of calibration (the model intercept (A) and the calibration slope (B)), discrimination via a concordance statistic (C) and clinical usefulness with decision-curve analysis (D).Reference Steyerberg and Vergouwe7 Model calibration is the level of agreement between the observed outcomes and the predictions. For example, if a model predicts a 5% risk of cancer, then, according to such a prediction, the observed proportion should be five cancers per 100 people. Discrimination is the ability of a model to distinguish between a patient with the outcome and one without.Reference Steyerberg and Vergouwe7 Our review found that only seven studies (54%) reported discrimination and just five (38%) reported any measure of calibration. The remaining studies reported only classification metrics, such as accuracy or balanced accuracy. The problem with solely reporting classification metrics is that they vary both across models and across different probability thresholds for the same model. This renders the comparison between models less meaningful. It is further argued that setting a classification threshold for a probability-generating model is premature. Rather, a clinician may choose to set different probability thresholds for the same prediction model, depending on the situation at hand, to optimise the balance between false positives and false negatives. For example, in the case of a model predicting cancer, a clinician may choose a lower probability threshold to offer a non-invasive screening test and a higher probability threshold to suggest an invasive and potentially harmful biopsy. Further, without any measure of model calibration, we are unable to assess if the model can make unbiased estimates of outcome.Reference Harrell38 The final key step in assessing the performance of a prediction model is to determine its clinical usefulness – that is, can better decisions be made with the model than without? Decision-curve analysis considers the net benefit (the treatment threshold weighted sum of true- minus false-positive classifications) for a prediction model compared with the default strategy of treating all or no patients, across an entire range of treatment thresholds.Reference Vickers, van Calster and Steyerberg39 Only two studies (15%) included in our review considered whether the model was clinically useful. Without proper validation of the prediction models, the reported performances are likely to be overly optimistic. Four studies (31%) reported only apparent validity. Just four studies (31%) reported external validation, which is considered essential before applying a prediction model to clinical practice.Reference Steyerberg and Harrell14
Altogether, just two studies (15%) had an overall ‘low’ risk of bias according to PROBAST, reflecting these methodological limitations. Neither study considered real-world implementation. To progress with implementation, impact studies are required. These would involve a cluster randomised trial comparing patient outcomes between a group with treatment informed by a clinical prediction model and a control group.Reference Moons, Kengne, Grobbee, Royston, Vergouwe and Altman40 We are not aware of any such study having been carried out within the field of psychiatry. However, Salazar de Pablo et al suggest that PROBAST thresholds for considering a study to be a ‘low’ risk of bias may be too strict.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver6 Indeed, in the field of machine learning, multiple imputation is frequently computationally infeasible, and single imputation may be viewed as sufficient. This is especially true in larger data-sets or in the presence of relatively few missing values.Reference Steyerberg41
Strengths and limitations
Our review had a number of strengths. We provide the first systematic overview of prediction-modelling studies for use in patients with FEP. We offer a detailed critique of the study characteristics, their methodologies and model performance metrics. Further, our review adheres to gold-standard guidance for extracting data from prediction models and for assessing bias, namely the CHARMS checklist and PROBAST.
There were several limitations. Our initial aim was to perform a meta-analysis of any prediction model that was validated across different settings and populations. However, no meta-analysis was possible because no single prediction model was validated more than once. In addition, as a consequence of poor reporting of discrimination and calibration performance across the studies, it was often difficult to make meaningful comparison between the prediction models. Also, the lack of consensus as to the most important outcome measure in FEP, with six different outcomes considered across only 13 included studies, further hindered efforts at drawing meaningful comparisons between the included studies and their respective prediction models. Likewise, if more studies had considered the same outcome measures, this may have afforded the opportunity to validate existing prediction models rather than necessitating the creation of additional new models. All published prediction-modelling studies in FEP reported significant positive findings. It is possible that studies that had negative findings were held back from publication, reflecting the possibility of publication bias. We originally intended to evaluate the overall certainty in the body of evidence by using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) framework.Reference Schünemann, Oxman, Brozek, Glasziou, Jaeschke and Vist42 GRADE was originally designed for reviews of intervention studies, but has not yet been adapted for use in systematic reviews of prediction models. Consequently, in its current form, we did not find GRADE to be a suitable tool for our review and decided not to use it. Future research should consider how to adapt GRADE for use in systematic reviews of prediction models.
Implications for future research
It is clear that there is a growing trend for the development of prediction models in FEP.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver6 FEP is an illness that responds best to an early intervention paradigm.Reference Birchwood, Todd and Jackson43 Prediction models have the potential to optimise the allocation of time-critical interventions, like clozapine for treatment resistance.Reference Farooq, Choudry, Cohen, Naeem and Ayub44 However, several steps are necessary before meaningful implementation into real-world clinical practice. The field must prioritise external validation and replication of existing prediction models in larger sample sizes, to increase the EPV. This is best accomplished by an emphasis on data-sharing and open collaboration. Prediction studies should include FEP cohorts from low-income countries, where there is considerable potential for benefit by helping to prioritise limited resources to those most in need. Harmonisation of data collection across the field, both in terms of predictors and outcomes measured, would facilitate validation efforts. There should be a greater consideration of biologically relevant and cognitive predictors based on our growing understanding of disease mechanisms, which could optimise prediction model performance. Finally, our review highlights considerable methodological pitfalls in much of the current literature. Future prediction-modelling studies should focus on methodological rigour with adherence to accepted best-practice guidance.Reference Wynants, Van Calster, Collins, Riley, Heinze and Schuit9,Reference Steyerberg and Harrell14,Reference Harrell38 Our goal in psychiatry should be to develop an innovative approach to care by using prediction models. Application of these approaches into clinical practice would enable rapid and targeted intervention, thereby limiting treatment-associated risks and reducing patient suffering.
Supplementary material
Supplementary material is available online at https://doi.org/10.1192/bjp.2021.219.
Data availability
Data is available from the corresponding author, S.P.L., upon reasonable request.
Author contributions
P.K.M. and R.L. formulated the research question and designed the study. R.L., S.P.L., L.T. and P.K.M. collected the data. R.L., S.P.L. and P.K.M. analysed the data and drafted the manuscript. L.T., G.V.G., S.J.W., S.-J.H.F., F.D. and J.C. critically evaluated and revised the manuscript.
Funding
R.L. is funded by the Institute for Mental Health Priestley Scholarship, University of Birmingham. S.P.L. is funded by a clinical academic fellowship from the Chief Scientist Office, Scotland (CAF/19/04). S.J.W. is funded by the Medical Research Council, UK (grant MR/K013599).
Declaration of interest
G.V.G. has received support from Horizon 2020 E-Infrastructures (H2020-EINFRA), the National Institute for Health Research (NIHR) Birmingham Experimental Cancer Medicine Centre (ECMC), NIHR Birmingham Surgical Reconstruction Microbiology Research Centre (SRMRC), the NIHR Birmingham Biomedical Research Centre, and the Medical Research Council Health Data Research United Kingdom (MRC HDR UK), an initiative funded by UK Research and Innovation, Department of Health and Social Care (England), the devolved administrations and leading medical research charities. J.C. has received grants from the Wellcome Trust and Sackler Trust, and honorariums from Johnson & Johnson. P.K.M. has received honorariums from Sunovion and Sage, and is a Director of Noux Technologies Limited. All other authors declare no competing interests.
eLetters
No eLetters have been published for this article.