Hostname: page-component-f554764f5-rvxtl Total loading time: 0 Render date: 2025-04-16T12:32:03.981Z Has data issue: false hasContentIssue false

Improving healthcare cost prediction for chronic disease through covariate clustering and subgroup analysis methods

Published online by Cambridge University Press:  15 April 2025

Zhengxiao Li
Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
Yifan Huang*
Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
Yang Cao
Affiliation:
School of Insurance and Economics, University of International Business and Economics, Beijing, China
*
Corresponding author: Yifan Huang; Email: [email protected]

Abstract

Predicting healthcare costs for chronic diseases is challenging for actuaries, as these costs depend not only on traditional risk factors but also on patients’ self-perception and treatment behaviors. To address this complexity and the unobserved heterogeneity in cost data, we propose a dual-structured learning statistical framework that integrates covariate clustering into finite mixture of generalized linear models, effectively handling high-dimensional, sparse, and highly correlated covariates while capturing their effects on specific subgroups. Specifically, this framework is realized by imposing a penalty on the prior similarities among covariates, and we further propose an expectation-maximization-alternating direction method of multipliers (EM-ADMM) algorithm to address the complex optimization problem by combining EM with the ADMM. This paper validates the stability and effectiveness of the framework through simulation and empirical studies. The results show that our framework can leverage shared information among high-dimensional covariates to enhance fitting and prediction accuracy, while covariate clustering can also uncover the covariates’ network relationships, providing valuable insights into diabetic patients’ self-perception data.

Type
Research Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The International Actuarial Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Alanazi, R. (2022) Identification and prediction of chronic diseases using machine learning approach. Journal of Healthcare Engineering, 2022, 2826127.CrossRefGoogle ScholarPubMed
Andrade, D., Fukumizu, K. and Okajima, Y. (2021) Convex covariate clustering for classification. Pattern Recognition Letters, 151, 193199.CrossRefGoogle Scholar
Atienza, N., Garca-Heras, J., Muñoz-Pichardo, J.M. and Villa, R. (2008) An application of mixture distributions in modelization of length of hospital stay. Statistics in Medicine, 27 (9), 14031420.CrossRefGoogle ScholarPubMed
Avanzi, B., Taylor, G., Wang, M. and Wong, B. (2024) Machine learning with high-cardinality categorical features in actuarial applications. ASTIN Bulletin: The Journal of the IAA, 54 (2), 213238.CrossRefGoogle Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3 (1), 1122.CrossRefGoogle Scholar
Chen, J., Tran-Dinh, Q., Kosorok, M.R. and Liu, Y. (2021) Identifying heterogeneous effect using latent supervised clustering with adaptive fusion. Journal of Computational and Graphical Statistics, 30 (1), 4354.CrossRefGoogle ScholarPubMed
Chen, K., Huang, R., Chan, N.H. and Yau, C.Y. (2019) Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insurance: Mathematics and Economics, 86, 818.Google Scholar
Cheng, C., Feng, X., Li, X. and Wu, M. (2022) Robust analysis of cancer heterogeneity for high-dimensional data. Statistics in Medicine, 41 (27), 54485462.CrossRefGoogle ScholarPubMed
Chi, E.C. and Lange, K. (2015) Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24 (4), 9941013.CrossRefGoogle ScholarPubMed
Delong, Ł., Lindholm, M. and Wüthrich, M.V. (2021) Gamma Mixture Density Networks and their application to modelling insurance claim amounts. Insurance: Mathematics and Economics, 101, 240261.Google Scholar
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39 (1), 122.CrossRefGoogle Scholar
Devriendt, S., Antonio, K., Reynkens, T. and Verbelen, R. (2021) Sparse regression with multi-type regularized feature modeling. Insurance: Mathematics and Economics, 96, 248261.Google Scholar
Duncan, I., Loginov, M. and Ludkovski, M. (2016) Testing alternative regression frameworks for predictive modeling of health care costs. North American Actuarial Journal, 20 (1), 6587.CrossRefGoogle Scholar
Fellingham, G.W., Kottas, A. and Hartman, B.M. (2015) Bayesian nonparametric predictive modeling of group health claims. Insurance: Mathematics and Economics, 60, 110.Google Scholar
Fung, T.C., Tzougas, G. and Wüthrich, M.V. (2023) Mixture composite regression models with multi-type feature selection. North American Actuarial Journal, 27 (2), 396428.CrossRefGoogle Scholar
Girvan, M. and Newman, M.E.J. (2002) Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99 (12), 78217826.CrossRefGoogle Scholar
Gneiting, T. and Raftery, A.E. (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102 (477), 359378.CrossRefGoogle Scholar
Halder, A., Mohammed, S., Chen, K. and Dey, D.K. (2021) Spatial tweedie exponential dispersion models: An application to insurance rate-making. Scandinavian Actuarial Journal, 2021 (10), 10171036.CrossRefGoogle Scholar
Hallac, D., Leskovec, J. and Boyd, S. (2015) Network lasso: Clustering and optimization in large graphs. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387396.CrossRefGoogle Scholar
Ickowicz, A. and Sparks, R. (2017) Modelling hospital length of stay using convolutive mixtures distributions. Statistics in Medicine, 36 (1), 122135.CrossRefGoogle ScholarPubMed
Khalili, A. and Chen, J. (2007) Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102 (479), 10251038.CrossRefGoogle Scholar
Khalili, A. and Lin, S. (2013) Regularization in finite mixture of regression models with diverging number of parameters. Biometrics, 69 (2), 436446.CrossRefGoogle ScholarPubMed
Kurz, C.F. and Hatfield, L.A. (2019) Identifying and interpreting subgroups in health care utilization data with count mixture regression models. Statistics in Medicine, 38 (22), 44234435.CrossRefGoogle ScholarPubMed
Lee, S.C.K. (2021) Addressing imbalanced insurance data through zero-inflated Poisson regression with boosting. ASTIN Bulletin: The Journal of the IAA, 51 (1), 2755.CrossRefGoogle Scholar
MacLeod, H., Yang, S., Oakes, K., Connelly, K. and Natarajan, S. (2016) Identifying rare diseases from behavioural data: A machine learning approach. 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), pp. 130139. IEEE.CrossRefGoogle Scholar
Meng, X. and Rubin, D.B. (1993) Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80 (2), 267278.CrossRefGoogle Scholar
Richardson, R. and Hartman, B. (2018) Bayesian nonparametric regression models for modeling and predicting healthcare claims. Insurance: Mathematics and Economics, 83, 18.Google Scholar
Shi, P. and Zhang, W. (2015) Private information in healthcare utilization: Specification of a copula-based hurdle model. Journal of the Royal Statistical Society: Series A (Statistics in Society), 178 (2), 337361.CrossRefGoogle Scholar
Witten, D.M., Shojaie, A. and Zhang, F. (2014) The cluster elastic net for high-dimensional regression with unknown variable grouping. Technometrics, 56 (1), 112122.CrossRefGoogle ScholarPubMed
Yach, D., Hawkes, C., Gould, C.L. and Hofman, K.J. (2004) The global burden of chronic diseases: Overcoming impediments to prevention and control. JAMA, 291 (21), 26162622.CrossRefGoogle ScholarPubMed
Yang, Y., Qian, W. and Zou, H. (2018) Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models. Journal of Business & Economic Statistics, 36 (3), 456470.CrossRefGoogle Scholar
Zhu, Y. (2017) An augmented ADMM algorithm with application to the generalized lasso problem. Journal of Computational and Graphical Statistics, 26 (1), 195204.CrossRefGoogle Scholar
Supplementary material: File

Li et al. supplementary material 1

Li et al. supplementary material
Download Li et al. supplementary material 1(File)
File 461 Bytes
Supplementary material: File

Li et al. supplementary material 2

Li et al. supplementary material
Download Li et al. supplementary material 2(File)
File 1.7 MB
Supplementary material: File

Li et al. supplementary material 3

Li et al. supplementary material
Download Li et al. supplementary material 3(File)
File 10.8 KB