Introduction
Genital tract infections (GTIs) are infectious diseases that often go undetected as epidemics and constitute a huge public health concern [Reference Choe, Lee and Yang1]. They are mostly ignored, misdiagnosed, or unreported leading to incorrect treatment and undetected transmission [Reference Flores-Mireles, Walker and Caparon2–Reference Tomas, Getman and Donskey5]. Among pregnant women, GTIs can cause spontaneous abortions and where the foetus survives, the risk of congenital diseases adversely affects the quality of life [Reference Liu, Zeng and Yang4].
The WHO estimated that globally, more than one million sexually transmitted infections (STIs) affecting the genitals occur daily, the majority of which are asymptomatic. In 2016, more than 490 million people worldwide had genital herpes [6]. In sub-Saharan Africa (SSA), approximately 6.9% of self-reported STIs exist among young women aged 15–24 years, of which Ghanaian women accounted for 0.3% [Reference Dadzie, Agbaglo and Okyere7]. For sexually active men, the average prevalence of self-reported STIs in SSA was found to be 3.8%, with Ghana accounting for 5.7%. The highest prevalence was found amongst sexually active men aged 15–24 years [Reference Seidu, Ahinkorah and Dadzie8].
It is important to enhance the capacity of frontline healthcare providers to detect and manage GTIs effectively. This underscores the need to strengthen capacity beginning at the community level to ensure that most infections that otherwise go undetected are managed to curb transmission. This is because genital infections are treatable, provided they are diagnosed early [Reference Ratnaprabha, Thimmaiah and Johnson9, Reference Surya, Shivasakthimani and Muthathal10]. In low-resource nations, where there are numerous barriers to assessing medical treatment, infections spread quickly and widely compounding amongst other things, the adverse outcomes of reduced fecundability and sterility associated with genital infections in men and women [Reference Gamberini, Juliana and de Brouwer11, Reference Moragianni, Dryllis and Andromidas12]. At the community level, resource constraints in terms of diagnostic capacity and attendant costs justify the need to have a robust model to predict GTIs. In order to accurately select the individual risk of GTIs at the community level, this study uses the machine learning (ML) technique, i.e the least absolute shrinkage and selection operator (LASSO) penalized cross-validation regression approach [Reference Thivalapill, Jasumback and Perry17] to generate a prediction model. The model can be used to identify the most significant predictors and offer insights into contextual factors that contribute to the occurrence of sGTIs. We, therefore, integrate the LASSO regression model to find characteristics that predict sGTIs to increase early detection and intervention.
The merits of ML compared to traditional statistical methods for predictive modelling is that the nonlinear nature of many real-world phenomena is better captured by ML techniques such as deep learning, offering superior predictive power [Reference Tang, Kurths and Lin13, Reference Kavzoglu and Teke14]. Again, the LASSO ML can simultaneously perform variable selection and regularization (penalization or shrinkage) to constrain or shrink the regression coefficients [Reference Gunn, Rezvan and Fernández15], thereby simplifying models and improving predictive performance.
LASSO regression has emerged as a powerful tool in predicting sexually transmitted infections (STIs), as evidenced by studies conducted both in the United States and Africa. In the United States, Comulada et al. [Reference Comulada, Rotheram-Borus and Arnold16] utilized LASSO regression to identify predictors of STIs, leveraging its capacity for variable selection and regularization. Similarly, in Africa, Thivalapill et al. [Reference Thivalapill, Jasumback and Perry17] developed a predictive tool for STIs in Eswatini, employing LASSO regression as a central component of their methodology. Through the application of LASSO, they tackled issues of multicollinearity and overfitting while discerning significant predictors of STIs tailored to the specific demographic and environmental characteristics of Eswatini. These studies underscore the versatility and relevance of LASSO regression in public health research, particularly in identifying critical factors associated with STI transmission. By leveraging the predictive capabilities of LASSO, researchers can enhance their understanding of STIs dynamics and inform targeted interventions and policies aimed at mitigating the burden of STIs globally.
Methods
Description and study design
We utilized five rounds of data from the Ghana demographic health survey (GDHS) conducted from 1993 to 2014. GDHS was a nationally stratified survey conducted across the country employing a multi-stage cluster sampling design. The surveys were supported by the United Nations Population Fund (UNFPA), the United States Agency for International Development (USAID), the World Bank, and other development partners. These demographic health surveys are aimed at providing information on fertility, family planning, infant and child mortality, maternal and child health, and nutrition. The purpose of GDHS is to inform policy decisions, planning, monitoring, and evaluating programmes related to health in general and reproductive health across the country [18].
GDHS employs a cross-sectional study design to a nationally representative sample, using two-stage sampling criteria. The first stage involved selecting clusters consisting of enumeration areas (EAs) independently within the then ten administrative regions in Ghana. This was done considering the rural-urban differential characteristics. The second stage involved systematic sampling from a list of households in all the selected EAs. The number of households enlisted in each EA makes up the EA size. A household was selected, and all men and women aged 15–49 who met the inclusion criteria were enumerated for the study. Details of the Demographic Health Survey (DHS) sampling design can be found elsewhere [Reference Fisher and Way19].
Study participants
This current study merged data on men and women in their reproductive age (i.e. those aged 15–49 years) from 2003 to 2014.
Outcome measures
The main outcome was self-reported genital tract infections (sGTIs). The GDHS assessed sGTIs by asking participants who had ever engaged in sexual intercourse and had experienced STI or symptoms of an STI (a foul-smelling, abnormal discharge from the vagina or penis or a genital sore or ulcer) within 12 months preceding the GDHS survey. The item response theory (IRT) [Reference Baker20] was used to compute the outcome variable. The IRT is a sophisticated statistical framework used to understand how individuals’ responses to test items relate to their underlying latent traits, such as abilities or attitudes. At its core, IRT views each test item as a statistical model with its own unique parameters, including difficulty, discrimination, and guessing. These parameters describe the item’s characteristics and how it interacts with individuals’ trait levels. By modelling the relationship between item responses and latent traits, IRT enables more precise measurement and assessment of individuals’ abilities (representing some underlying trait related to the variables for which respondents are providing yes or no responses), allowing for fairer comparisons across different tests and populations [Reference Yang and Kao21]. Three items were considered to generate sGTIs composite variable (coded as 0 and 1): abnormal genital discharge, genital sore or ulcer, and any sexually transmitted diseases. These variables are coded as 0 “No” and 1 “Yes”. The one-parameter logistic model for the probability sGTI among the participants based on the three items was defined as $ p\left({X}_{ni}\right)=\frac{\exp \left({\theta}_n-{b}_i\right)}{\left[1+\exp \left({\theta}_n-{b}_i\right)\right]} $ where $ p\left({X}_{ni}\right) $ is the probability of an individual n having sGTI to item i; θn represent the ability of an individual n to report GTIs at community level and $ {b}_i $ representing the difficulty in reporting GTIs of item i. After applying IRT application, predicted probabilities were generated. Predicted value ≥0.5 was classified as having sGTI (coded as 1) and otherwise (coded as 0). Supplementary Figure 1 presents the items characteristics curves for the individual sGTIs, which are simultaneous to each other.
Data analysis
Two approaches to data analyses were employed: descriptive and inferential models. For descriptive analysis, independent variables were described using frequencies and weighted percentages for categorical variables while measures of dispersion involving weighted means±standard deviation and median (interquartile range) were adopted for continuous variables. Weighted analysis was adopted because the study design of DHS allows for adjusting for the sampling weight, sampling unit, and strata. The forest plot was used to present the prevalence of sGTIs by GDHS year of study (i.e. 2003, 2008, and 2014) considering the DerSimonian–Laird random-effect meta-analysis [Reference DerSimonian and Laird22] to assess differences of sGTIs between the years. This was employed to assess heterogeneity in the prevalence rate of sGTIs across the demographic health survey years. The DerSimonian and Laird model is a widely recognized method in meta-analysis that effectively incorporates random effects to address variability among studies. This model is particularly useful when dealing with heterogeneous studies where differences in results are anticipated [Reference DerSimonian and Laird22, Reference Bender, Friede and Koch23]. Unlike fixed-effect models that assume a single common effect size, the DerSimonian and Laird model accounts for varying effect sizes across studies by allowing for random effects. This flexibility helps in producing a more accurate and generalizable summary of the effect [Reference DerSimonian and Laird22]. This method employs inverse-variance weighting and considers both within country and between country variability. By effectively managing the inherent variability and heterogeneity in our data, this model offers a more thorough and reliable evaluation of temporal sGTI differences. The test of non-linear simultaneous equality of proportion was adopted to assess the differences in sGTI by socio-demographic categories using the Rao–Scott Wald χ2 test.
For inferential analysis, we adopted the (LASSO) penalized regression to predict and select the best predictors of sGTI. The DHS sampling weight and clusters were controlled for during estimation. The LASSO is an extension of ordinary least square (OLS) regression, which adds a penalty to the OLS residual sum of squares. When a data value narrows towards a central point, shrinkage occurs. As a result, it is well suited to models with high levels of multicollinearity to identify any potential high correlation between the outcome variable and the predictors [Reference Chintalapudi, Angeloni and Battineni24]. LASSO selects a subset of predictors by shrinking the coefficients of the least dominant variables to zero, thereby excluding them from the model.
The tenfold cross-validation was adopted to determine the amount of coefficient shrinkage. The cross-validation process divides the available data into multiple folds, using one of these folds as a validation set, and training the model on the remaining folds. This process is repeated multiple times and the results from each validation step are averaged to produce a more robust effect size estimate of the model [Reference Sharma25]. The performance of the model was assessed using the cross-validation discriminant analysis considering the area under the curve (cvAUC). The value of the AUC indicates the ability of the model to differentiate between individuals with sGTI and otherwise [Reference Steyerberg, Vickers and Cook26].
The bootstrapped Bias-corrected 95% confidence intervals [Reference Grün and Miljkovic27] for the AUC were also generated as a sensitivity analysis. The calibration belt plot was additionally adopted to assess the agreement between the model-predicted probabilities and actual observed rates of sGTI. The calibration belt plots examine the variation between expected and observed probabilities (miscalibration) at certain confidence levels [Reference Fenlon, O’Grady and Doherty28].
Model building and checking
Factors for model building were considered following a priori review of the literature identifying 11 potential factors associated with STIs. These factors included; wealth quintile [Reference Anguzu, Flynn and Musaazi29], number of household members [Reference Curtis, Field and Clifton30], region [Reference Anguzu, Flynn and Musaazi29], place of residence [Reference Anguzu, Flynn and Musaazi29, Reference Seidu, Agbaglo and Dadzie31], sanitation [Reference Ademas, Adane and Sisay32], sex [Reference Lim, Wong and Cook33], age group [Reference Birhane, Simegn and Bayih34, Reference Dadzie, Agbaglo and Okyere35], educational level [Reference Birhane, Simegn and Bayih34], sexual initiation [Reference Seidu, Agbaglo and Dadzie31, Reference Dadzie, Agbaglo and Okyere35, Reference Shrestha, Karki and Copenhaver36], currently working [Reference Dadzie, Agbaglo and Okyere35] and staying with partner [Reference Birhane, Simegn and Bayih34]. These predictors are depicted in the conceptual framework in Supplementary Figure 2. Three models were estimated as presented in the Table 1.
GDHS = Ghana demographic and health survey
The best-fitted model was selected following the assessment of the AUC. Additionally, the Akaike information criterion (AIC) given by $ \mathrm{AIC}=-2\ln (L)+2p $ and Bayesian information criterion (BIC) given by $ \mathrm{BIC}=-2\ln (L)+\frac{1}{2}p\hskip0.4em \ast \hskip0.1em \mathit{\ln}(n) $ were adopted to assess the best model. The smallest AIC and BIC were considered the best fit. We implemented a sensitivity analysis to examine the potential influence of GDHS stratification (enumeration areas) on both the discriminant and calibration belt plot analyses. This involved utilizing bootstrapping resampling methods with 1000 replicates for the optimal model. The probability of an individual self-reporting sGTI in Ghana among men and women aged 15–49 years equals the inverse of a logistic regression equation (model) given as;
where β is the coefficient estimate for all the parameters from the LASSO penalty regression analysis. All analyses were conducted using Stata 17 (StataCorp. 2017, Stata Statistical Software, College Station, TX, StataCorp LLC.).
Ethical requirements
Permission to use the secondary data was requested from the DHS Monitoring and Evaluation to Assess and Use Results of Demographic and Health Surveys (MEASURE DHS) Department. Request can be obtained from DHS https://dhsprogram.com/data/dataset_admin/login_main.cfm
Results
The analysis involved 32973 men and women aged 15–49 years with a mean age ± standard deviation of 29.1 ± 9.7 years. The majority were females (60.7%) with an approximately equal numbers of rural-urban differential. Most of the participants had secondary level education (58.1) and the majority were currently working (Table 2).
Generally, the prevalence of sGTIs within the period (2003–2014) was 11.2% (95% CI = 4.5–17.8) and it ranged from 5.4% (95% CI = 4.8–5.86) in 2003 to 17.5% (95% CI = 16.4–18.7) in 2014 (Figure 1). The differences in sGTIs proportions by socio-demographic characteristics were significantly different within all predictors (p-value<0.05) (Supplementary Table 1).
Model building
In Model 1, the analysis revealed that the mean AUC from tenfold cross-validation discriminant analysis was approximately 70.79% (95% CI = 69.8–71.6). Upon adjusting for an interaction effect in Model 2, the probability increased slightly to 70.81% (95% CI = 69.81–71.59). Further adjustments for cohort effect in Model 3 significantly increased the probability to 73.50% (95% CI = 72.50–74.26). This means that there is approximately a 74% probability that Model 3 will correctly rank a person chosen at random from the community level in Ghana (aged 15–49 years) with sGTIs higher than a negative random chosen person. This suggests that Model 3 demonstrates the highest ability to accurately classify individuals with and without sGTIs at the community level in Ghana. This conclusion is supported by the fact that Model 3 has the smallest AIC (21223.99) and BIC (21240.79) values compared to Models 1 and 2 (Table 3 and Figure 2).
cv = cross validation; AUC = area under the curve; CI = confidence interval; AIC = Akaike information criterion; BIC = Bayesian information criterion. 95% CI estimates are bootstrap bias-corrected from the LASSO tenfold cross-validation.
The AUC result from the sensitivity analysis was statistically not different from the main model result. The bootstrap resampling adjusting for clustering supported this (Supplementary Figure 1).
Calibration
The calibration belt plot analysis showed that the predicted probabilities of sGTIs from the LASSO penalty regression model are statistically not different from the observed sGTIs rates across all the probabilities (p-value>0.05). Indicating that all the models performed well by not misclassifying individuals from the LASSO at 10-fold cross-validation (Figure 2).
The best predictive model
The best model penalty prediction from LASSO regression retained all variables as predictors. By default, LASSO regression fits 76 models using different values of lambda. The best model (model 76) had the smallest cross-validation mean prediction error with a mean deviance of 0.6524. The overall mean lambda was approximately 0.01 (Table 4).
All variables were selected by the LASSO prediction model. Penalized regression coefficients were derived after a penalty was applied which reduces overfitting of the data during model development. Using the regression coefficients from the predictors. X indicates category used for reference.
The best model equation is presented in the Supplementary Material.
The probability of an individual self-reporting sGTI in Ghana among men and women aged 15–49 years ranged from 0.51% to 53.33%. Figure 3 illustrates that higher probabilities are strongly associated with the occurrence of sGTIs, suggesting that as the predicted probability increases, the likelihood of experiencing sGTIs also rises. This demonstrates that the model effectively captures the relationship between higher probability scores and the presence of sGTIs.
Discussion
This current study generated a prediction model for sGTIs at the community level in Ghana. We derived this model using a machine learning technique involving objective socio-demographic, and behavioral/environmental indicators. The model performed well in predicting an individual at the community level with sGTI. The AUC from the discriminant analysis was over 70% indicating an acceptable level and the ability of the prediction model to discriminate the true sGTIs at the community level. The calibration belt also showed a good overall performance with no miscalibration indicating that the predicted values were not statistically different from the observed values. We believe our analysis offers a promising diagnostic tool for screening individual risks of sGTIs in the community.
The overall prevalence of sGTIs during the study period was 11.2%, and this prevalence ranged from 5.4% in 2003 to 17.5% in 2014, showing fluctuations over time. Similarly in sub-Saharan Africa, Thivalapill et al., identified a prevalence of 10.12% among adolescents and young adults in Eswatini [Reference Thivalapill, Jasumback and Perry17] and approximately 13.5% in Liberia [Reference Seidu, Ahinkorah and Dadzie37]. This variation might be due to differences in awareness or knowledge about sexually transmitted diseases and differences in measuring the outcome (self-reporting versus diagnosis or screening). Differences in access to health care services, socio-economic/cultural systems that influence health-seeking behavior, and demographic characteristics as well as in sample size, could also account for the variation observed [Reference Birhane, Simegn and Bayih34].
The high prevalence of sGTIs and the increasing rate at the community level indicate an urgent need for early screening and detection. Screening high-risk individuals at the community level using clinical history will be economically viable for a developing country like Ghana. Self-reporting if encouraged, can result in more testing and diagnosis. Diagnosing cases helps to prevent infected individuals from remaining a reservoir of infection within the community. This study has identified 12 key predictors of sGTIs including; wealth quintile, sex, interaction term (sex and wealth), number of household members, region, place of residence, sanitation, educational level, sexual initiation, currently working, staying with partner and cohort effect. Thivalapill et al. also incorporated 11 potential predictors in their predictive model for STIs among persons living with HIV [Reference Thivalapill, Jasumback and Perry17]. Notably, our study also considered similar predictors, including age group and sex, which were among the six predictors included by Thivalapill et al. Drawing parallels between our findings and those of Thivalapill et al. highlights age group and sex consistencies across different populations and contexts, enhancing the external validity of our results. The additional variables integrated into the final model have been independently identified as associated with STIs by scholars in different research settings, underscoring their robustness and relevance in understanding and addressing STI transmission dynamics.
This prediction tool is a remarkable instrument for screening and detecting sGTIs at the community level in Ghana to promote early management. The model AUC was highly acceptable indicating over 70% ability to discriminate true sGTIs at the community level. The resulting individual risk score derived from the model may allow healthcare providers and other stakeholders to identify individuals with the highest predicted risk for early intervention. A self-reporting model can be applied in a community survey by trained healthcare workers at the community level to identify and refer cases requiring medical attention. It can be developed into an algorithm that public health nurses and community health workers can apply during home visits to improve linkage to diagnosis and treatment, and as a complement to current approaches in managing sexually transmitted infections and genital tract infections, even at the lowest level of the health care system. Secondary prevention of GTIs through early diagnosis and treatment is both a health and a development issue. The predictive model has the potential to contribute towards Sustainable Development Goal 3, target 3.3 which seeks to combat communicable diseases and end epidemics such as HIV/AIDS (which is facilitated by GTIs) by 2030.
Limitations and strengths of the study
The key limitation is the cross-sectional nature of the study. The design does not allow for establishing causality. In this regard, the findings of this study should be interpreted with caution. Even though the AUC and calibration bootstrapping were acceptable, there is the need for external validity of the tool using current similar data in Ghana given that the latest version of GDHS was in 2014 at the time this study was conducted. Again, the outcome considered was self-reported which may not be the true reflection. This is because self-reported morbidity alone cannot serve as an indicator to measure the burden of any disease at the community level [Reference Balamurugan and Bendigeri38]. The self-reported measurements were not validated from records or clinical examination to confirm. Participants may or may not report the condition, leading to social desirability bias and under or over-estimations. Again, recall bias may occur because of the period specified for recall.
The study has some merits in that it provides useful information that can be explored for validation in field situations, and this is the subject of our next paper. In a follow-up paper, we will apply the predictive model to the next GDHS data to test the model’s reliability. Again, we hope to conduct a nationally stratified survey on genital tract infection prevalence in Ghana by adopting clinical examination to test the model fidelity in field situations. If validated, it presents a useful opportunity to improve the diagnosis and treatment of GTIs at the community level. Healthcare facilities at the community level are not equipped with point-of-care diagnostics for field use based on the level of competence at this level of healthcare. Where point-of-care diagnostic is available, it is often used in independent surveys through donor funding. The predictive model can enhance linkage to higher-level facilities where they can be managed.
Conclusion
Generally, the model performance was very good and acceptable. With the absence of clinical measurement, this prediction model can be used to identify individuals aged 15–49 years with sGTIs at the community level in Ghana. Potential indicators including poverty, urban place of residence, male sex, and lower education were highly associated with sGTIs. By using the indicators, a risk score was derived for individual at the community level, predicting the risk of sGTIs. Healthcare workers and other stakeholders can use this tool for screening and early detection at the community level to complement current approaches in managing sexually transmitted and genital tract infections, even at the lowest level of the healthcare system. We, therefore, propose field testing and external validation as the next step before adoption.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0950268824001444.
Acknowledgments
The authors are grateful to DHS for providing the data for this research work.
Author contribution
JT and MYN conceptualized the study. JT sought approval for access to GDHS data. JT undertook the statistical analysis. MYN, JT, SMS, EAU, and AEY drafted the initial manuscript and provided intellectual content revisions. All authors read and approved the final review manuscript.
Data availability
The data that support the findings of this study are available from DHS but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of DHS from https://dhsprogram.com/data/dataset_admin/login_main.cfm
Funding statement
No funding support.
Competing interest
No competing interest.
Ethical considerations and consent to participate
The GDHS protocol was reviewed and approved by the Ghana Health Service Ethical Review Committee and the ICF Institutional Review Board examined. The ICF IRB guarantees that the survey follows all U.S. regulations. Regulations for the protection of human subjects issued by the Department of Health and Human Services (45 CFR 46). Individual women’s written consent was obtained during the data collection process for all participants. Privacy and confidentiality were strictly adhered to.
Consent for publication
Not applicable.