Background
For the purposes of this editorial, we define machine learning as a field where statistical models automatically improve using computer algorithms that respond to repeated experience with data. We separate this approach from more traditional approaches where a static method is used once to derive an outcome (for example a t-test), despite the recognition that the two fields share methods and are interconnected.
Computational power
The first point is to acknowledge why we are interested in new statistical approaches. Psychiatric research predominantly aims to guide decisions based on group averages for a majority of individuals (e.g. from a clinical trial) or provide insight regarding associations (e.g. from cohort studies). These approaches are essential, but there is an increasing recognition that obtaining a specific diagnostic-, prognostic- or treatment-response likelihood for an individual patient will help clinicians to make personalised decisions regarding care.
The machine-learning paradigm has achieved such predictions for individual examples or cases in other fields (such as speech, text or image recognition) by using statistics to solve practical problems with computers. This shift in culture, together with large advances in computational power, cast old statistical techniques in a new light and opened the door to advanced methods that are now seamlessly integrated into daily life (such as those employed by Google, Amazon, Netflix or Apple). As such, we are interested in this new statistical approach because we hope that such a pragmatic approach will fast-track personalised psychiatric treatment by providing additional tools to clinicians, clients and their families.Reference Dwyer, Falkai and Koutsouleris1
Pattern detection
The second point is related to limitations within the existing psychiatric research culture. Traditional psychiatric methods restrict statistical choices and rely on assumptions to facilitate inferences to a population beyond the sample. Researchers design studies with such restrictions in mind, analyse data and ultimately make decisions that influence guidelines, inform our understanding of illness and identify new therapies. In general, the majority of such statistical models are either not designed to be used on individuals or, if they have, they have not been powerful enough to obtain a clinically translatable prediction that is currently used.Reference Dwyer, Falkai and Koutsouleris1
The machine-learning field partly grew from the idea that to facilitate prediction at the level of a single observation (for example an individual) we need to permit more statistical freedom, relax assumptions and entertain exploratory approaches that allow computers to learn from often multilayered and multidimensional data (for example from the clinical, brain or genetic sources as seen in this issue). The power of this freedom to find new predictive patterns in multidimensional data is largely why machine learning has replaced traditional statistical and computer programming approaches in multiple corporate and scientific domains.Reference Topol2
Overfitting risk
A danger of more statistical freedom, however, is that it comes with an increased risk of finding results that are only accurate in a single sample and cannot be more widely applied in other contexts. This is known as ‘overfitting’ where idiosyncratic attributes of a sample (such as random noise) are modelled instead of identifying patterns that generalise to new cases and contexts. Thus, the third main point is that this overfitting risk is thought to be enhanced in machine-learning contexts and this needs to be kept in-mind at the current time. However, it is also important to recognise that the machine-learning field has popularised and extended statistical methods that test and optimise the ability of the algorithms to generalise to new cases, samples, sites, countries or continents.
At an initial level, most methods that assess generalisability rely on data resampling schemes that simulate the application of algorithms to new data in order to obtain accuracy estimates;Reference Dwyer, Falkai and Koutsouleris1 for example, the commonest is to use cross-validation where a subsample of individuals is put aside, algorithms learn patterns in the remaining sample, the models are applied to the held-out subsample to determine their accuracy, and the process is repeated. In addition to these simulations, many articles in this special issue use forms of ‘external validation’ where the statistical algorithms are tested in completely new data-sets – for example from different studies or geographic locations. Such techniques are not unique to machine-learning contexts, but are more important in the field because of the risk of overfitting.
Representativeness of samples
A related fourth point to consider regards the representativeness of the sample that determines the scope of generalisability claims and potential sources of bias. Representativeness of the sample can first be assessed by considering clinical knowledge regarding the degree to which the results from the sample can support the conclusions of the study. For example, when making strong translational claims it is important for samples to be representative of real-world clinical environments rather than highly controlled scientific designs or methods.
Questions regarding bias can also be derived from clinical experience and relate to such factors as site, study, country, demographics or clinical differences. Assessing whether biases have been addressed is important for translational claims and can be tested with innovative resampling schemes (such as leave-group-out cross-validationReference Dwyer, Falkai and Koutsouleris1) in addition to the gold standard use of diverse external validation samples. Without assessments of bias there is the potential that the statistical models may not perform accurately based on such individual factors as race, ethnicity or gender – where machine-learning recommendations have been shown in other fields to be less accurate because the algorithms have predominantly learned decision rules from dominant majority groups. The integration of clinical knowledge into the design of machine learning tools is thus especially important in order to increase the representativeness of the samples and consider potential biases.
Real-world utility and implementation
The final point from a clinical perspective is to consider the real-world clinical utility and implementation of machine-learning tools, which are areas where the engagement with the wider research community is especially important. The usefulness of a statistical prediction is only as good as the ability for it to improve care to a degree that justifies the cost (and risk) of its implementation. Such questions can first be addressed by considering the potential of a tool to improve the status quo of clinical routines related to diagnoses, prognoses and treatment selection by assessing common quantitative metrics used in predictive contexts (such as accuracy, positive predictive value or area under the curve; see the Appendix). Increased confidence in the potential clinical utility can also be generated with additional assessments; for example, comparing machine-learning predictions with those made by clinicians in the same study, using net-benefit analyses to quantify the balance between the benefit (for example accurately predicting an illness) with potential harms (for example unnecessary testing), or by using decision curve and calibration analyses.Reference Leighton, Krishnadas, Upthegrove, Marwaha, Steyerberg and Gkoutos3
Even if a tool is deemed to be sufficiently generalisable, the biases are known, it is better than existing clinical tools and has a clinical benefit, the final component of assessing whether the tool could be used is whether it could be practically implemented. Recent work in general medical fields has highlighted the unexpected difficulties with implementing highly promising tools into hospital settings,Reference He, Baxter, Xu, Xu, Zhou and Zhang4 which emphasises the need for ongoing input from clinical teams around how the most promising tools may actually work in real life. Towards these translational ends, some studies now provide web- or app-based platforms to test the capacity to deploy machine-learning algorithms (such as www.proniapredictors.eu). Additionally, providing algorithms is increasingly important for enhancing transparency through open science principles that are critical across the clinical sciences to facilitate understanding, replication and collaboration.
Conclusions
When combined, the five points to consider when reading a machine-learning paper were designed to provide important context for the papers in the following special issue and to engage a clinical audience. Moving forward, this clinical engagement will be critical for the field to progress and we hope that the special issue will encourage further dialogue towards a clinical future that includes the ability to tailor treatment approaches to individuals in real-time based on machine learning models. To further facilitate such a dialogue we have provided a glossary of terms (Appendix) that can be used as a reference and also a supplementary figure to aid understanding about analytic pipelines (see Supplementary Materials available at https://doi.org/10.1192/bjp.2022.29). We also invite interested readers to engage with other review papers in psychiatry.Reference Dwyer, Falkai and Koutsouleris1,Reference Chekroud, Bondar, Delgadillo, Doherty, Wasil and Fokkema5
Supplementary material
To view supplementary material for this article, please visit http://dx.doi.org/10.1192/bjp.2022.29
Author contribution
D.D. and R.K. wrote the article and provided the Table and Figure.
Funding
R.K.'s research is funded by Research and development, National Health Service - Greater Glasgow and Clyde, Chief scientist office Scotland and Medical research council, UK. D.D.'s research is funded by a National Alliance for Research on Schizophrenia & Depression Young Investigator Grant (no. 30196).
Declaration of interest
R.K. and D.D. do not have any conflicts of interest pertaining to this article.
eLetters
No eLetters have been published for this article.