When comparing the effectiveness of competing healthcare programmes across disease areas, there is a growing interest in estimating health outcomes on a generic metric, such as quality-adjusted life years (QALYs). To enable QALY calculations, a preference-based health-related quality of life instrument, also referred to as a health state utility (HSU) instrument,Reference Brazier, Ratcliffe and Salamon1 is essential. Such HSU instruments consist of a descriptive system and a predetermined value set that reflects the preferences of the general population, which assign a value – or utility – to each possible combination of health states in the descriptive system.
In clinical trials, however, we find condition-specific instruments to be more commonly applied than generic instruments. This is because clinicians have an affinity to the gold standard instruments within their speciality, but also because condition-specific instruments tend to identify disease-specific changes in health that might not be identified by a generic descriptive system. In cases where condition-specific data have been collected and decision makers want effectiveness to be expressed on a generic metric, there is a need for a mapping algorithm to convert condition-specific data to HSU.Reference Brazier, Ratcliffe and Salamon1, Reference Dakin2 Such mapping algorithms are commonly developed by distributing both measures of interest to the same respondents, and applying statistical methods to predict utilities from scores on a source instrument.
Health outcome measures
Depression is a common mental disorder and one of the main causes of disability worldwide.3 It can last for long periods or re-occur, impairing work or school performance and the ability to cope with daily life. Although a wide range of mental health outcome measures are suitable to measure its effect, they do not produce utilities. The Depression Anxiety Stress Scales (DASS-21)Reference Lovibond and Lovibond4 and Kessler Psychological Distress Scale (K-10)Reference Kessler, Barker, Colpe, Epstein, Gfroerer and Hiripi5 are two of the most widely used mental health-specific instruments, assessing core symptoms of depression, anxiety and stress.Reference Mihalopoulos, Chen, Iezzi, Khan and Richardson6
The most widely used HSU instrument is the EQ-5D. A recent review supported its dominant position by revealing that 70% of cost–utility studies had applied the EQ-5D.Reference Wisloff, Hagen, Hamidi, Movik, Klemp and Olsen7 One reason for its widespread use is that it has been recommended by the National Institute for Health and Care Excellence (NICE) in the UK.8 Studies generating mapping algorithms for producing EQ-5D utilities are increasing in number, especially after NICE endorsed mapping if the direct measure of EQ-5D utility is unavailable.Reference Dakin2
This paper has three aims. First, we aim to replace the existing mapping algorithms for DASS-21 and K-10 that were recently published in the British Journal of Psychiatry.Reference Mihalopoulos, Chen, Iezzi, Khan and Richardson6 The paper by Mihalopoulos et al was based on an interim EQ-5D-5L value set,Reference van Hout, Janssen, Feng, Kohlmann, Busschbach and Golicki9 which was developed based on the value set for the three-level version.Reference Dolan10 Most recently, eight country-specific value sets have been published for the EQ-5D-5L instrument, including four Western countries (England, the Netherlands, Spain and Canada), three Asian countries (China, Japan and Korea) and one South American (Uruguay).Reference Augustovski, Rey-Ares, Irazola, Garay, Gianneo and Fernandez11–Reference Shiroiwa, Ikeda, Noto, Igarashi, Fukuda and Saito18 The previously published mapping algorithm is already becoming obsolete in the literature after the publication of the directly elicited EQ-5D-5L official value sets.
Second, we aim to investigate if mapping algorithms for the two mental health instruments differ across countries, depending on country-specific health state preferences. Because health state preferences differ across countries,Reference Zhao, Li, Liu, Zhang and Chen19 their EQ-5D-5L value sets differ accordingly. Hence, there is a need to develop country-specific mapping algorithms.
Third, we aim to make important methodological contributions. Although the paper by Mihalopoulos et al applied two different mapping models (ordinary least squares regression (OLS) and generalised linear models (GLM)),Reference Lovibond and Lovibond4 this paper further investigates the relative merit of six regression models. Best practice for reporting mapping studies are followed, based on the Mapping Preference-based Measures Reporting Standards statement.Reference Petrou, Rivero-Arias, Dakin, Longworth, Oppe and Froud20
Method
Sample
Data were obtained from the Multi-Instrument Comparison study, which is based on an online survey administered in six countries (Australia, Canada, Germany, Norway, UK and USA) by a global panel company, CINT Australia Pty Ltd.Reference Richardson, Iezzi and Maxwell21 The current paper is based on respondents who were diagnosed with depression (n = 917). The depression group were asked to describe their condition on both the DASS-21 and the K-10, as well as the EQ-5D-5L. For further details on respondent description, see Richardson et al Reference Richardson, Iezzi and Maxwell21 and Mihalopoulos et al. Reference Mihalopoulos, Chen, Iezzi, Khan and Richardson6
Instruments
DASS-21
The DASS-21 comprises 21 items, each with a four-point severity scale indicating how much the statement applies to the respondent (did not apply to me; applied to some degree; applied a considerable degree; applied very much or most of the time).Reference Lovibond and Lovibond4 It comprises three seven-item subscales that measure core symptoms of depression, anxiety and stress. The items of each subscale are summed into a scale score ranging from 0 to 42, where lower values indicate fewer problems.
K-10
The K-10 measures psychological distress comprising 10 items asking about anxiety and depressive symptoms experienced in the past 4 weeks.Reference Kessler, Barker, Colpe, Epstein, Gfroerer and Hiripi5 Each item has five response levels (none of the time; a little of the time; some of the time; most of the time; all of the time). Items are summed into a scale score of 10–50, where lower values indicate less problems.
EQ-5D-5L
The EQ-5D consists of five items/dimensions: mobility, self-care, usual activities, pain/discomfort and anxiety/depression. The five-level version (EQ-5D-5L) is based on the original three-level version (EQ-5D-3L) by inserting two more response levels to each dimension to reduce potential ceiling effects and improve reliability and sensitivity.Reference Herdman, Gudex, Lloyd, Janssen, Kind and Parkin22 The five response levels are no problem, slight problem, moderate problem, severe problem and unable to/extreme problem. The instrument produces 3125 (55) health states. The utility scores were calculated by applying eight country-specific value sets: England, the Netherlands, Spain, Canada, China, Japan, Korea and Uruguay.Reference Augustovski, Rey-Ares, Irazola, Garay, Gianneo and Fernandez11–Reference Shiroiwa, Ikeda, Noto, Igarashi, Fukuda and Saito18
Statistical analysis
Descriptive
Spearman's rank correlation (rs) and exploratory factor analyses (EFA) were used to assess the degree of conceptual overlap between the source instruments (DASS-21 and K-10) and the target instrument (EQ-5D-5L). EFA with principal axis factoring was used, which has been recommended as the preferred method of factor extraction.Reference Russell23 An eigenvalue >1 and the scree test was used as selection criteria to extract underlying constructs.Reference Russell23 Further, as the extracted factors are usually correlated,Reference Antony, Bieling, Cox, Enns and Swinson24 a promax rotation was applied.Reference Fabrigar, Wegener, MacCallum and Strahan25 Correlations between the extracted factors were also observed (see supplementary Table 2a and b).
A direct mapping technique was applied by regressing EQ-5D-5L utility index onto the source instrument, either the DASS-21 subscale scores or K-10 total score. Six alternative models were estimated and compared (as described below). For every regression model, a forward stepwise selection method was used for variable selection (P < 0.05). To make mapping equations applicable to all data-sets, only age and gender were considered as covariates. Interaction and squared terms were only considered if the original variable was significant. Indirect mapping (i.e. response mapping) is not suitable in this case because of the limited overlap between the two depression scales and the EQ-5D-5L. This issue is demonstrated in the EFA results. In indirect mapping, responses to each of the five dimensions of the EQ-5D-5L will be predicted in the first step before further applying the country-specific value sets. With limited overlap across dimensions in two instruments (i.e. mainly mental health dimension in EQ-5D-5L), the prediction error for four physical health-related dimensions of EQ-5D-5L will be large.
Regression models
OLS is the most commonly used regression model in mapping studies,Reference Brazier, Yang, Tsuchiya and Rowen26 and requires data to be normally distributed with constant variance. Unlike the OLS, the GLM allows for skewed distribution (i.e. non-normal distribution) of the dependent variable. Gamma family and log-link function fit the model well for GLM in this data. Because gamma and log function are defined for non-negative values, EQ-5D-5L disutility (where disutility is equal to 1 – EQ-5D-5L utility) was used. Beta binomial regression allows the dependent variable to be skewed and is capable of modelling bounded dependent variables restricted between 0 and 1, which is often the case with utility instruments. As this parametric model is not defined at the boundary values, the outcome values should be restricted to a 0–1 range, excluding 0 and 1. This can be achieved by linear transformation [Y(N−1) + 0.5]/N following earlier literature,Reference Smithson and Verkuilen27 where N refers to sample size, and Y is the dependent variable. For applications of the beta binominal regression model, see Khan et al Reference Khan, Morris, Pashayan, Matata, Bashir and Maguirre28 for detail. Another similar approach for modelling bounded data defined on [0, 1] scale that involves a semi-parametric approach is the fractional regression model (FRM). It was developed to address the modelling of empirical bounded dependent variables, such as proportions and percentages, that exhibit piling-up at one of the two corners.Reference Papke and Wooldridge29 In the FRM model, EQ-5D-5L scores are linearly transformed onto a 0–1 scale by subtracting the minimum score from EQ-5D-5L and then dividing by the range. For both beta binomial and FRM, the logit link function fits the model well in this data and is applied here. The logit transformation used in the prediction of EQ-5D-5L utility is given as:
where X is a vector of predictors (i.e., the DASS depression and anxiety subscales score or the K-10 overall score) and age, and β is a vector of estimated coefficients.
MM-estimation is a robust regression estimation approach that is appropriate when the residual distribution is non-normal or some outliers affect the model.Reference Susanti, Sri Sulistijowai, Pratiwi and Liana30 MM-estimation estimates the regression parameter by S-estimation, which minimise the scale of the residual from M-estimation and then proceeds with M-estimation. The S in S-estimation stands for the scale of the residual, the M in M-estimation stands for maximum likelihood type and the MM in MM-estimation stands for minimising M-estimation.Reference Susanti, Sri Sulistijowai, Pratiwi and Liana30 It aims to obtain estimates that have a high breakdown value and is more efficient. The breakdown value is a common measure of the proportion of outliers that can be addressed before these observations affect the model.Reference Ayinde, Lukman and Arowolo31 Censored least absolute deviations (CLAD) model is more appropriate for outcome variables censored at one or both end-points.Reference Powell32 The CLAD model is a semi-parametric estimator that is robust to distributional assumptions and heteroscedasticity because it uses median values rather than means among similar groups, as medians are likely to be less affected by censoring.
Model performance
In line with previous research,Reference Brazier, Yang, Tsuchiya and Rowen26 the predictive performance of each model described above was assessed by mean absolute error (MAE) and root mean square error (RMSE). Both were computed for the full sample (where lower values indicate better fit). The MAE is defined as the average of absolute difference between observed and predicted EQ-5D-5L. The RMSE is the square root of the average of the squared differences between observed and predicted EQ-5D-5L. Both MAE and RMSE were adjusted for the degrees of freedom, as the number of independent variables may differ across models.
It has been shown that the wider the scale length of the EQ-5D-5L, the larger the error.Reference Versteegh, Leunis, Luime, Boggild, Uyl-de Groot and Stolk33 Therefore, adjusting for scale differences would allow reasonable comparison between data-sets or models with different scales. Although there are no standard ways of normalisation in the literature, we normalise both MAE and RMSE to the range (defined as the difference between the maximum and the minimum values) of the measured data. Such normalised RMSE (NRMSE) and normalised MAE (NMAE) are non-dimensional and enable us to compare data-sets and models with different units or scales. Lastly, the performance of each model was also assessed by the square of the correlation coefficient between the observed and predicted values adjusted for the number of predictors in the model (adjusted r 2).Reference Sullivan and Ghushchyan34 In addition, binned scatter plots between the observed and predicted EQ-5D-5L utilities were reported to visualise the predictive performance of each model.
To investigate the generalisability of the preferred mapping algorithms, cross-validation was performed by splitting the existing data into two: estimation and validation samples via random selection procedures. In this study, the total sample was randomly divided into two equal groups to evaluate the model fit in out-of-sample data. The model was fitted on the estimation sample, and the resulting parameters from the fitted model were then used to predict the EQ-5D-5L on the validation sample. This procedure has been repeated by reversing the validation and estimation sample. The average RMSE, MAE and r 2 for both iterations were calculated for comparison of the models' predictive performance. Lastly, the best-fitting model was estimated with the full sample (N = 917). All statistical analyses were conducted with Stata version 14.2 (StataCorp, College Station, Texas, USA), except the EFA, which was carried out in SPSS version 24 (IBM Corp, Armonk, New York, USA).
Ethical approval
Data for this study were obtained from the Multi-Instrument Comparison project, which was approved by the Monash University Human Research Ethics Committee (numbers CF11/1758-2011000974 and CF11/3192-2011001748).
Results
Sample characteristics are presented in Table 1. The estimated EQ-5D-5L utility scores varied both in the mean score and the range, depending on the choice of country-specific value sets. In the depression sample, the mean EQ-5D-5L utility ranged from 0.59 (Dutch value set) to 0.83 (Uruguayan value set). The minimum utility score ranged from −0.41 in the Dutch value set to 0.12 in the Korean and Uruguayan value set. Spearman's rank correlations are presented in supplementary Table 1, available at https://doi.org/10.1192/bjo.2018.21. Among EQ-5D-5L dimensions, anxiety/depression dimension produced the highest correlation with the source instruments (r s ≥ 0.50), whereas mobility dimension produced the lowest (r s ≤ 0.25). The three DASS-21 subscales were highly correlated with each other (r s = 0.63–0.73).
The EFA was appropriate as indicated by a Kaiser–Meyer–Olkin measure of sampling adequacy of >0.90 and a highly significant Bartlett's test of sphericity. The pattern matrix for EFA with at least 0.30 (factor) loadings are reported in Table 2a and b. The EFA analysis for DASS-21 and EQ-5D-5L items produced four underlying factors (depression, anxiety, stress and physical functioning), explaining 60% of the variance. The extracted factors replicate the original factor structure of DASS-21 subscales: depression, anxiety and stress, except item 2: ‘I was aware of dryness of my mouth’, which was originally part of the anxiety subscale. However, this item produced weak loadings on three factors: physical (0.288), stress (0.197) and anxiety (0.161). The result revealed conceptual overlap between the anxiety/depression dimension of EQ-5D-5L and the extracted DASS-21 depression factor. All remaining (four) EQ-5D-5L dimensions were mainly loaded on the fourth factor (i.e. physical functioning).
Note. Loadings below 0.30 not shown, except for item two of the Depression Anxiety Stress Scales (DASS-21), where the highest loading is reported in brackets. Rotation method: promax with Kaiser normalisation.
Note. Loadings below 0.30 are not shown. Rotation method: promax with Kaiser normalisation.
K-10, Kessler Psychological Distress Scale.
Considering the result with K-10 items, three factors were extracted: depression, anxiety and physical functioning (Table 2b). Again, only EQ-5D-5L anxiety/depression dimension loaded on the extracted K-10 depression factor. No single item from K-10 items was mainly loaded to the last factor (physical), which was formed by the first four dimensions of EQ-5D-5L. The structure matrix presented in supplementary Table 2a and b (which shows the correlation of each item with the extracted factors) revealed similar results with Spearman's correlation coefficients (supplementary Table 1).
Table 3 presents model performance based on the English value set. Fractional logistic regression performed best when we consider adjusted r 2 and NRMSE for both DASS-21 and K-10. In terms of NMAE, CLAD and MM-estimation performed best for DASS-21, and MM-estimation performed best for K-10. Similar result was revealed by cross-validation. This result was also supported by the scatter plot (supplementary Fig. 1).
Note. The best results are in bold type.
adj. r 2, square of correlation coefficient between predicted and observed EQ-5D-5L, penalised for number of predictors; CLAD, censored least absolute deviation; DASS-21, Depression Anxiety Stress Scales; FRM, fractional regression model; GLM, generalised linear model; K-10, Kessler Psychological Distress Scale; NMAE, normalised mean absolute error; NRMSE, normalised root mean square error; OLS, ordinary least squares regression.
Model performance based on other country specific value sets are presented in supplementary Table 3a and b. Except for the Japanese value set, FRM was preferred in terms of adjusted r 2 and NRMSE, whereas MM-estimation or CLAD was preferred with NMAE. For the Japanese value set, beta binomial regression was a preferred model when NMAE and NRMSE were considered, whereas FRM was preferred in terms of adjusted-r2.
Table 4 presented regression results when the English value set was applied. Based on the criteria described above, best-fitting regression results for the other country-specific value sets were presented in supplementary Table 4. When DASS-21 was the source instrument, the depression and anxiety subscales and age were significant (P < 0.05) predictors in all models. When K-10 was the source instrument, the K-10 total scale and age were significant (P < 0.05) predictors.
Note. Robust standard errors are shown in parentheses.
DASS-21, Depression Anxiety Stress Scales; K-10, Kessler Psychological Distress Scale.
a. Based on the English value set.
b. All coefficients significant at P < 0.001.
Unlike the linear regression model, the beta binomial and FRM estimation produce non-linear relationships between predictors and the targeting EQ-5D-5L utilities. The beta binomial and FRM coefficients are not directly interpretable. In this study, we are not interested in interpretation of the raw coefficients but rather in the prediction of EQ-5D-5L utilities. An example has been given below to show how to use the results reported in Table 4 to calculate the predicted EQ-5D-5L utilities from K-10, using the logit transformation. Assuming the mean value for both age and the K-10 score (i.e. 42 and 29.2, respectively), the predicted EQ-5D-5L utility can be calculated as Y = exp(3.52220−0.01382×42−0.06476×29.2)/(1 + exp(3.52220−0.01382×42−0.06476×29.2)) = 0.741.
Discussion
Given the increasing use of the EQ-5D instrument in healthcare decision-making, there is a need for updated mapping of disease-specific instruments onto the recently developed preference-based value sets for the new 5L version of the EQ-5D. This study aimed at developing mapping algorithms from two widely used depression rating scales, the DASS-21 and the K-10, onto eight official country-specific EQ-5D-5L value sets. Further, we assessed the merits of six different regression models.
Based on the comparison of these regression models, the result showed that the FRM model was generally the best performing model in predicting the EQ-5D-5L utility index. The only exception was for the Japanese value set, where the beta binomial regression model was preferred. The relative performance of different regression models was the same when either DASS-21 or K-10 was the source instrument.
In general, beta binomial regression produced the second best adjusted r 2 estimate in all cases, whereas the MM-estimation or CLAD overall produced the lowest MAE. Censoring is not a problem in our sample, where <2% report full health on EQ-5D-5L. The novelty of the FRM and the beta binomial model is that they are more appropriate for data that is bounded (as is the case for EQ-5D) and the non-linearity in the data is accounted for. Further, FRM does not make any distributional assumption about an underlying structure used to obtain the dependent variable.Reference Papke and Wooldridge29 Note that both mean and median regressions were assessed in our study. The main concern when assessing mapping results is the accuracy of the predictions. Thus, the use of mean or median regressions were the means to the end; that is, to obtain better prediction of individual utilities, which is important for cost-effectiveness analyses.
Previously, one study has published mapping equations from DASS-21 and K-10 onto EQ-5D-5L with the same data-set.Reference Mihalopoulos, Chen, Iezzi, Khan and Richardson6 However, our study provides important contributions. First, the previous study only considered OLS and GLM, whereas we have compared six different regression models suitable for the sample data, e.g. problems of normality and heterogeneity of variance. Second, the previous study applied an interim value set that is already becoming obsolete after the publication of country-specific value sets that are based on directly elicited EQ-5D-5L preferences. Thus, as expected, the preferred model and the performance of these preferred models in terms of goodness-of-fit were quite different. For instance, the preferred model for the new English value set produced r 2, MAE and RMSE values of 0.342, 0.111 and 0.150, respectively, for DASS-21 compared with 0.332, 0.155 and 0.206 in the previous study.Reference Lovibond and Lovibond4 Similarly, the preferred model for the K-10 produced an r 2, MAE and RMSE of 0.337, 0.110 and 0.151, respectively, compared with 0.361, 0.150 and 0.201 in the previous study, indicating better predictive performance in our study. These differences in goodness-of-fit may, in part, be because of differences in the scale of the target instrument and the regression method applied. Third, we have shown that mapping functions will differ across countries depending on cross-cultural diversity in the preferences on which EQ-5D-5L value sets are based. In addition, different covariates have been used in the two studies. The previous study included country dummies and gender, whereas our study has considered respondents' age and gender alone.
A recent review of mapping studies found that the goodness-of-fit measured by r 2 ranges from 0.17 to 0.71, with most studies reporting an r 2 between 0.4 and 0.5.Reference Brazier, Yang, Tsuchiya and Rowen26 A study by Lindkvist and FeldmanReference Lindkvist and Feldman35 assessed mapping a mental health-specific outcome measure (12-item General Health Questionnaire) onto EQ-5D-3L with the UK and Swedish value sets. They reported an r 2 and RMSE of 0.18 and 0.20 for the UK value set, and 0.24 and 0.07 for the Swedish value set, respectively, when the 12-item General Health Questionnaire alone was used as a predictor. Another study by Brazier et al Reference Brazier, Connell, Papaioannou, Mukuria, Mulhern and Peasgood36 mapped the Hospital Anxiety and Depression Scale onto EQ-5D-3L in two different samples. They reported an r 2 of 0.24 and RMSE of 0.227 in the first sample, and an r 2 of 0.19 and RMSE of 0.188 in the second sample. The mapping algorithm produced in our study showed better performance, although they differ in terms of methodological approach and predictor variables used.
Mapping algorithms generally suffer from overprediction of utility values for respondents in poor health and underprediction for respondents in better health.Reference Brazier, Yang, Tsuchiya and Rowen26 This was also the case in our study (see supplementary Fig. 1). A possible reason for this may, in part, be a lack of conceptual overlap between the source instruments and EQ-5D-5L. For instance, as revealed by the EFA, only the anxiety/depression dimension of the EQ-5D-5L has been mainly loaded onto one of the same factors that the disease-specific outcomes were designed to measure. Another plausible reason would be the strong decrements of preference weights of the EQ-5D-5L at a severe health state, i.e. when moving from level 3 to level 4.Reference Olsen, Lamu and Cairns37 This study has explored the mapping algorithms for different value sets of EQ-5D-5L against depression scales. Because different EQ-5D-5L value sets produce different utility scores, especially at the lower end, the country-specific mapping algorithm should be a better option to reflect the preference from a particular country. Furthermore, this is the first study to assess the predictive accuracy of different EQ-5D-5L value sets with the DASS-21 and K-10 instrument. Considering the multinational nature of the patient population used, our algorithms may have wider generalisability. However, as generalisability is a major issue for mapping studies, it should be tested how these models perform in different patient populations.
This study has some limitations. First, it is based on respondents who volunteered to participate, something that might lead to self-selection bias. Second, as the EFA results indicated, the conceptual overlap between the source and target instruments is limited. However, if the generic instrument covers important dimensions of the source instrument, it is feasible to conduct mapping studies.Reference Brazier, Ratcliffe and Salamon1 Although the physical dimensions of EQ-5D-5L are less correlated with DASS-21 and K-10, results from the EFA revealed conceptual overlap with the depression scales. Furthermore, studies have shown that EQ-5D reflects the effect of common mental health conditions such as mild to moderate depression,Reference Lovibond and Lovibond4, Reference Brazier38 suggesting that mapping depression scales onto EQ-5D is plausible.
In conclusion, this study has developed a set of mapping algorithms to predict EQ-5D-5L utility values from the DASS-21 or the K-10. Thus, in the absence of generic health-related quality of life data, the preferred mapping model can adequately convert disease-specific scores onto a generic outcome metric such as QALYs, which facilitates economic evaluations of mental health interventions.
Funding
The Research Council of Norway (grant number 221452) funded the preparation of this manuscript. The Australian National Health and Medical Research Council (grant number 1006334) funded data collection, except for the Norwegian arms, which was funded by the University of Tromsø. The publication charges for this article have been funded by a grant from the publication fund at the University of Tromsø. No parties involved in this study have any commercial interest.
Supplementary material
Supplementary material is available online at https://doi.org/10.1192/bjo.2018.21
eLetters
No eLetters have been published for this article.