Improving healthcare cost prediction for chronic disease through covariate clustering and subgroup analysis methods

Zhengxiao Li; Yifan Huang; Yang Cao

doi:10.1017/asb.2025.13

Improving healthcare cost prediction for chronic disease through covariate clustering and subgroup analysis methods

Published online by Cambridge University Press: 15 April 2025

and

Zhengxiao Li: Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
Yifan Huang*: Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
Yang Cao: Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
*: Corresponding author: Yifan Huang; Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Predicting healthcare costs for chronic diseases is challenging for actuaries, as these costs depend not only on traditional risk factors but also on patients’ self-perception and treatment behaviors. To address this complexity and the unobserved heterogeneity in cost data, we propose a dual-structured learning statistical framework that integrates covariate clustering into finite mixture of generalized linear models, effectively handling high-dimensional, sparse, and highly correlated covariates while capturing their effects on specific subgroups. Specifically, this framework is realized by imposing a penalty on the prior similarities among covariates, and we further propose an expectation-maximization-alternating direction method of multipliers (EM-ADMM) algorithm to address the complex optimization problem by combining EM with the ADMM. This paper validates the stability and effectiveness of the framework through simulation and empirical studies. The results show that our framework can leverage shared information among high-dimensional covariates to enhance fitting and prediction accuracy, while covariate clustering can also uncover the covariates’ network relationships, providing valuable insights into diabetic patients’ self-perception data.

Keywords

Chronic diseases cost prediction covariate clustering subgroup analysis complex optimization problems

Type: Research Article
Information: ASTIN Bulletin: The Journal of the IAA , First View , pp. 1 - 21

DOI: https://doi.org/10.1017/asb.2025.13 [Opens in a new window]
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The International Actuarial Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Alanazi, R. (2022) Identification and prediction of chronic diseases using machine learning approach. Journal of Healthcare Engineering, 2022, 2826127.CrossRef Google Scholar PubMed

Andrade, D., Fukumizu, K. and Okajima, Y. (2021) Convex covariate clustering for classification. Pattern Recognition Letters, 151, 193–199.CrossRef Google Scholar

Atienza, N., Garca-Heras, J., Muñoz-Pichardo, J.M. and Villa, R. (2008) An application of mixture distributions in modelization of length of hospital stay. Statistics in Medicine, 27 (9), 1403–1420.CrossRef Google Scholar PubMed

Avanzi, B., Taylor, G., Wang, M. and Wong, B. (2024) Machine learning with high-cardinality categorical features in actuarial applications. ASTIN Bulletin: The Journal of the IAA, 54 (2), 213–238.CrossRef Google Scholar

Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3 (1), 1–122.CrossRef Google Scholar

Chen, J., Tran-Dinh, Q., Kosorok, M.R. and Liu, Y. (2021) Identifying heterogeneous effect using latent supervised clustering with adaptive fusion. Journal of Computational and Graphical Statistics, 30 (1), 43–54.CrossRef Google Scholar PubMed

Chen, K., Huang, R., Chan, N.H. and Yau, C.Y. (2019) Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insurance: Mathematics and Economics, 86, 8–18.Google Scholar

Cheng, C., Feng, X., Li, X. and Wu, M. (2022) Robust analysis of cancer heterogeneity for high-dimensional data. Statistics in Medicine, 41 (27), 5448–5462.CrossRef Google Scholar PubMed

Chi, E.C. and Lange, K. (2015) Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24 (4), 994–1013.CrossRef Google Scholar PubMed

Delong, Ł., Lindholm, M. and Wüthrich, M.V. (2021) Gamma Mixture Density Networks and their application to modelling insurance claim amounts. Insurance: Mathematics and Economics, 101, 240–261.Google Scholar

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39 (1), 1–22.CrossRef Google Scholar

Devriendt, S., Antonio, K., Reynkens, T. and Verbelen, R. (2021) Sparse regression with multi-type regularized feature modeling. Insurance: Mathematics and Economics, 96, 248–261.Google Scholar

Duncan, I., Loginov, M. and Ludkovski, M. (2016) Testing alternative regression frameworks for predictive modeling of health care costs. North American Actuarial Journal, 20 (1), 65–87.CrossRef Google Scholar

Fellingham, G.W., Kottas, A. and Hartman, B.M. (2015) Bayesian nonparametric predictive modeling of group health claims. Insurance: Mathematics and Economics, 60, 1–10.Google Scholar

Fung, T.C., Tzougas, G. and Wüthrich, M.V. (2023) Mixture composite regression models with multi-type feature selection. North American Actuarial Journal, 27 (2), 396–428.CrossRef Google Scholar

Girvan, M. and Newman, M.E.J. (2002) Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99 (12), 7821–7826.CrossRef Google Scholar

Gneiting, T. and Raftery, A.E. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102 (477), 359–378.CrossRef Google Scholar

Halder, A., Mohammed, S., Chen, K. and Dey, D.K. (2021) Spatial tweedie exponential dispersion models: An application to insurance rate-making. Scandinavian Actuarial Journal, 2021 (10), 1017–1036.CrossRef Google Scholar

Hallac, D., Leskovec, J. and Boyd, S. (2015) Network lasso: Clustering and optimization in large graphs. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387–396.CrossRef Google Scholar

Ickowicz, A. and Sparks, R. (2017) Modelling hospital length of stay using convolutive mixtures distributions. Statistics in Medicine, 36 (1), 122–135.CrossRef Google Scholar PubMed

Khalili, A. and Chen, J. (2007) Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102 (479), 1025–1038.CrossRef Google Scholar

Khalili, A. and Lin, S. (2013) Regularization in finite mixture of regression models with diverging number of parameters. Biometrics, 69 (2), 436–446.CrossRef Google Scholar PubMed

Kurz, C.F. and Hatfield, L.A. (2019) Identifying and interpreting subgroups in health care utilization data with count mixture regression models. Statistics in Medicine, 38 (22), 4423–4435.CrossRef Google Scholar PubMed

Lee, S.C.K. (2021) Addressing imbalanced insurance data through zero-inflated Poisson regression with boosting. ASTIN Bulletin: The Journal of the IAA, 51 (1), 27–55.CrossRef Google Scholar

MacLeod, H., Yang, S., Oakes, K., Connelly, K. and Natarajan, S. (2016) Identifying rare diseases from behavioural data: A machine learning approach. 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), pp. 130–139. IEEE.CrossRef Google Scholar

Meng, X. and Rubin, D.B. (1993) Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80 (2), 267–278.CrossRef Google Scholar

Richardson, R. and Hartman, B. (2018) Bayesian nonparametric regression models for modeling and predicting healthcare claims. Insurance: Mathematics and Economics, 83, 1–8.Google Scholar

Shi, P. and Zhang, W. (2015) Private information in healthcare utilization: Specification of a copula-based hurdle model. Journal of the Royal Statistical Society: Series A (Statistics in Society), 178 (2), 337–361.CrossRef Google Scholar

Witten, D.M., Shojaie, A. and Zhang, F. (2014) The cluster elastic net for high-dimensional regression with unknown variable grouping. Technometrics, 56 (1), 112–122.CrossRef Google Scholar PubMed

Yach, D., Hawkes, C., Gould, C.L. and Hofman, K.J. (2004) The global burden of chronic diseases: Overcoming impediments to prevention and control. JAMA, 291 (21), 2616–2622.CrossRef Google Scholar PubMed

Yang, Y., Qian, W. and Zou, H. (2018) Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models. Journal of Business & Economic Statistics, 36 (3), 456–470.CrossRef Google Scholar

Zhu, Y. (2017) An augmented ADMM algorithm with application to the generalized lasso problem. Journal of Computational and Graphical Statistics, 26 (1), 195–204.CrossRef Google Scholar

Li et al. supplementary material 1

Li et al. supplementary material

File 461 Bytes

Li et al. supplementary material 2

Li et al. supplementary material

File 1.7 MB

Li et al. supplementary material 3

Li et al. supplementary material

File 10.8 KB

Article contents

Improving healthcare cost prediction for chronic disease through covariate clustering and subgroup analysis methods

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Li et al. supplementary material 1

Li et al. supplementary material 2

Li et al. supplementary material 3

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests