Introduction
It has been hypothesized that many common diseases result from interactions among genetic, clinical, and environmental factors [Reference Hunter1–Reference Ramos and Olden4]. Identifying predictors of patients’ disease status may necessitate modeling interactions among predictors [Reference Carlborg and Haley5, Reference Kotti, Bickeboller and Clerget-Darpoux6]. Clinical studies also often evaluate patient outcomes over time requiring methods that can account for within-subject correlation. Generalized linear mixed model (GLMM) and generalized estimating equations (GEEs) address the correlation between repeated measures collected on a patient within a linear model framework [Reference Stroup7]. GLMMs incorporate a subject-specific random effect to account for correlation between outcome measures on the same patients, while GEEs account for this correlation in the residual variance matrix. The random subject effect in the GLMM allows for subject-specific inference, and fixed effects are interpreted conditional on the subject-specific effect, while GEEs yield a marginal model focused on the average population response given the covariates. Although both GLMMs and GEEs can model interactions, interactions should be specified a priori and sufficient sample size must be available to include interactions and main effects [Reference Harrell8]. In exploratory studies, interactions are not generally hypothesized a priori, and failure to examine interactions may result in failure to identify predictors associated with the outcome. In such cases, application of variable selection techniques can be employed to identify main effects and interactions most strongly associated with the outcome.
Methods for selecting variables in regression models fall into two categories: traditional stepwise variable selection methods and sparse model solutions. Traditional variable selection approaches include forward, backward, and stepwise algorithms that select variables for inclusion in a model based on statistical significance or measures of model fit. However, such approaches yield unstable parameter estimates, particularly in data with correlated predictors or outcomes [Reference Storlie, Bondell, Reich and Zhang9, Reference Breiman10].
Sparse model solutions simultaneously evaluate all potential covariates and shrink regression coefficient estimates toward zero. Penalized regression techniques penalize coefficients the further their estimated value is from zero by including a penalty term in the likelihood. Lasso regression, a popular penalized regression approach, includes the sum of the absolute values of the regression coefficients in the likelihood (referred to as the L 1 penalty) which shrinks many coefficients to zero providing a means of variable selection [Reference Liu, Huang and Zhang11]. Schelldorfer et al. [Reference Schelldorfer, Bühlmann and van de Geer12] extended lasso to the GLMM setting, a method referred to as glmmLasso. Boosting, originally proposed to improve classification procedures by combining estimates based on iteratively reweighted residuals [Reference Freund and Schapire13, Reference Friedman14], can also be used for variable selection by applying a component-wise approach [Reference Bühlmann and Yu15]. Tutz and Groll [Reference Groll and Tutz16] extend this component-wise boosting approach to the GLMM setting, a method referred to as GMMBoost. The GEE approach offers an alternative for modeling correlated binary outcomes without focusing on the within-subject covariance structure by estimating the population-average responses using a marginal approach. Similar to GMMBoost, Wolfson [Reference Wolfson17] developed a boosted estimating equation approach, EEBoost, for variable selection based on minimizing the L 1-constrained projected likelihood ratio. Brown et al. [Reference Brown, Miller and Wolfson18] further extend EEBoost to the GEE setting. The glmmLasso, GMMBoost, and GEEBoost methods provide a means of variable selection in data with large number of predictors and repeated measures in order to achieve a sparse model. Of note, all three methods yield biased estimates of the regression parameters [Reference Bühlmann and Yu15, Reference Tibshirani19, Reference Breiman20]. In order to address bias in parameter estimates, one could use a two-stage approach in which predictors are selected in the first stage using a sparse modeling approach followed by refitting a GLMM (or GEE) using the predictors selected in the first stage. However, such an approach has not been evaluated using sparse model solutions in data with repeatedly measured binary outcomes.
The goals of this paper are to provide an overview of variable selection methods for data with repeatedly measured binary outcomes and to evaluate the performance of each approach for recovery of interactions and main effects associated an outcomes to provide guidance for choosing from among these methods in application. We present a simulation evaluating the relative performance of stepwise selection, glmmLasso, GMMBoost, and GEEBoost for correct identification of nonzero regression parameters, model parsimony, and bias in the regression parameters. Additionally, we examine the ability of a two-stage approach to reduce bias in regression estimates from the three penalized and boosting methods. We apply each method to two clinical data sets and explore differences in the models returned by each method. The first data set examines a set of protein biomarkers for predicting treatment response in lupus nephritis. The second explores clinical and patient characteristics associated with risk of hypothermia over time in hip or knee arthroplasty patients. We conclude by summarizing advantages and disadvantages of each variable selection method and provide recommendations about the most appropriate choice of methods for variable selection in practice.
Variable Selection Algorithms for Repeatedly Measured Binary Outcomes
There are several methods proposed in the literature for identifying the best subset of predictors in a GLMM or GEE setting for a repeatedly measured binary response. In this section, we describe these methods in greater detail and discuss the advantages and disadvantages of each method.
Forward Variable Selection Algorithm
Stepwise algorithms iteratively evaluate predictors for inclusion in a regression model using forward, backward, or stepwise elimination. At each iteration, a variable is added or removed from a model based on a model performance metric such as an F-statistic, AIC, or BIC. The selection process ends when addition or removal of a variable no longer improves model fit. In traditional GLMMs, interactions are not typically included unless main effects are already in the model; thus, consideration must be given as to when interactions are considered in a model using a stepwise approach. Neerchael et al. [Reference Neerchael, Morel and Huang21] propose a stepwise selection algorithm in the GLMM setting and selects variables based on F-statistic and associated p-values. Their proposed algorithm also examines predictor interactions for inclusion or removal from the model once main effects are entered in the model. While traditional stepwise selection is commonly used in practice, there are issues with this approach [Reference Harrell8, Reference Austin and Tu22–Reference James and McCulloch27]. Specifically, variable subset selection is unstable in that small changes in the data can result in very different predictor subsets being selected [Reference Austin and Tu22, Reference Derksen and Keselman26, Reference James and McCulloch27], particularly when the number of candidate predictors is large.
Penalized Regression for GLMM
Penalized techniques introduce a penalty term in the likelihood that penalizes regression coefficients the further they are from zero. Penalized regression methods estimate regression parameters by maximizing the penalized log-likelihood:
where $M\left( \theta \right)$ is a general objective function, $l\left( {\theta \left| x \right{\hbox {\hbox {\hbox {\hbox {\hbox {\hbox {\hbox {\hbox .}}}}}}}}} \right)$ is the log-likelihood, $P\left( \theta \right)$ is a penalty parameter, and $\lambda $ is a regularization parameter determining how much weight the penalty term carries. Common choices of $P\left( \theta \right)$ in (Eq. (1)) include the L 1 penalty, $P\left( \theta \right) = \mathop \sum \nolimits_{j = 1}^p \left| {{\beta _j}} \right|\comma j = 1\comma 2\comma \ldots\comma p$ , referred to as lasso regression, and the L 2 penalty, $P\left( \theta \right) = \mathop \sum \nolimits_{j = 1}^p \beta _j^2\comma j = 1\comma 2\comma \ldots\comma p$ , referred to as ridge regression. Unlike the L 2 penalty, applying the L 1 penalty yields a solution in which some of regression coefficients shrink to zero resulting in simultaneous model fitting and variable selection.
Schelldorfer et al. [Reference Schelldorfer, Bühlmann and van de Geer12] extend lasso regression to the GLMM setting. The glmmLasso approach uses the log-likelihood function for a GLMM for $l\left( {\theta \left| x \right{\hbox {\hbox {\hbox {\hbox {\hbox {\hbox {\hbox {\hbox .}}}}}}}}} \right)$ and introduces a L 1 regularization parameter λ for the fix-effects vector $\beta $ in the likelihood function. Thus, $M\left( \theta \right)$ becomes L 1-penalized minimizing function of the log likelihood and the glmmLasso estimator is then identified by minimizing $\hat \vskip-10pt\hskip-4pt\theta = argmi{n_\theta }M(\theta )$ . Prior to fitting a model, the regularization parameter λ is selected by evaluating a range of possible λ’s and choosing the value that yields the minimum AIC or BIC.
Boosting for Variable Selection in the GLMM and GEE Settings
Boosting was originally proposed to improve prediction performance of weak classifiers [Reference Freund and Schapire13], and general boosting algorithms generate a set of models by iteratively fitting models based on a selected performance metric of the model from the previous iteration. Component-wise boosting iteratively estimates regression parameters in a linear model by maximizing the score equation based on parameter estimates from the prior iteration and then updating only the j th regression parameter that yields the maximum score equation [Reference Bühlmann and Hothorn28, Reference Hofner, Mayr, Robinzonov and Schmid29]. This approach can be used for model selection by choosing the optimal model from among models generated at each iteration based on a measure of model fit. Tutz and Groll [Reference Groll and Tutz16] propose use of a component-wise boosting algorithm, GMMBoost, for variable selection in the GLMM setting. GMMBoost estimates the Fisher matrix and the penalized score functions for the p predictors at the l th iteration based on the intercept, the j th regression coefficient (j = 1, 2,…, p), and random effects from the (l-1)th iteration. The predictor, j, that yields the smallest AIC or BIC is updated by adding the resulting product of the j th Fisher matrix and j th score equation to the estimates for the intercept, j th regression parameter and random effects from the (l-1)th iteration. All other parameter estimates are set to the values from the (l-1)th iteration.
The estimating equation approach has also been recently incorporated within a boosting framework for variable selection, a method referred to as EEBoost [Reference Wolfson17] and has been further extended to repeatedly measured binary outcomes [Reference Brown, Miller and Wolfson18]. Unlike the GLMM setting, there is not a closed form solution for the score equation in an estimating equation setting. Therefore, EEBoost and GEEBoost replace the score equation at each iteration with a set of estimating equations. Wolfson [Reference Wolfson17] showed that the proposed EEBoost algorithm, and by extension the GEEBoost algorithm, closely approximates the path generated from minimizing the L 1-constrained projected artificial log-likelihood ratio but without necessitating a constrained optimization algorithm. The optimization for both EEBoost and GEEBoost is based on functional gradient descent optimization. GEEBoost can be used for variable selection by choosing the model from the entire path of solutions that yields the smallest QIC, although alternative criterion could also be used.
Two-Stage Approach for Bias Correction of Parameter Estimates
Penalized and boosted regression models are known to yield biased regression parameter estimates [Reference Bühlmann and Yu15, Reference Tibshirani19, Reference Breiman20]. We propose a two-stage approach to correct the bias in the parameter estimates from each method. Similar two-stage approaches using lasso for finding interactions have been described though the focus of is on model stability, whereas our goal is bias reduction [Reference Wu, Devlin, Ringquist, Trucco and Roeder30]. In the first stage, a glmmLasso, GMMBoost, or GEEBoost model is fit to the data including all main effects and interactions. In the second stage, a traditional GLMM is fit using all coefficients with nonzero values in the first stage. Heredity conditions are generally used in linear models to maintain hierarchy between main effects and interactions; GlmmLasso, GMMBoost, and GEEBoost do not impose hierarchical restrictions. Under strong hierarchy, all main effects corresponding to interaction terms are included in the model [Reference Hamada and Wu31–Reference Hao and Zhang34]. However, there are cases where predictors are associated with an outcome through their interaction, for example, certain genetic epistasis models [Reference Hallgrmsdttir and Yuster35]. Thus, we considered two approaches for the two-stage GLMM: (1) impose the heredity constraint fitting a GLMM with nonzero terms from the penalized model plus all main effects that are zero in the penalized model but are in ≥1 nonzero interaction and (2) ignore the heredity constraint fitting a GLMM with only nonzero terms ignoring main effects if they are zero.
Methods: Simulation Study
A simulation study was conducted to evaluate the ability of each method to (1) correctly recover predictors and interactions associated with a repeatedly measured binary outcome, and to evaluate (2) reduction in bias in regression parameter estimates using the two-stage approach, (3) model parsimony and (4) computational efficiency (average time to develop models). Simulation parameters including variable distributions, covariate effect size, number of observations per subject, sample size, the model used to generate the response, and the model fit using each method are defined in Table 1. Data included 14 predictors, X 1-X 14, yielding 91 possible two-way interactions, a random subject effect, and a repeatedly measured binary response. Correlation for time-varying covariates, Σ, was compound symmetric with ρ = 0.25. Potential predictors and interactions generated in the data included a mix of categorical and continuous variables as well as fixed-time and time-varying covariates to mimic what might be observed in a clinical data set. Binary response variable Y was generated from a Bernoulli distribution where the probability depended on variable X 5, the interaction X 1 X 2, and the interaction X 2 X 5. Dependence of Y on the time-varying covariates X 1 and X 2 created a correlated response within subject. Effect sizes for nonzero parameters in the true model were selected to represent weak, moderate, and strong effects and yield odds ratios of 1.2, 2.0, and 4.5 for a one unit increase in X 5, the interaction X 1 X 2, and the interaction X 2 X 5.
Five-hundred data sets were generated for each combination of sample and effect size for data with two observations per subject. Similarly, 500 data sets with 10 observations per subject were generated for each sample size for and effect size of 0.2. Forward stepwise GLMM, glmmLasso, GMMBoost, and GEEBoost models were fit to each simulated data set in R v 3.2.5 (Team RDC) using the lme4, glmmLasso, GMMBoost, and threeboost [Reference Bates, Mchler, Bolker and Walker36–Reference Wolfson and Miller39]. GlmmLasso, GMMBoost, and GEEboost models considered all main effects and interactions in the first stage. The forward stepwise approach [Reference Neerchael, Morel and Huang21] considered interactions for inclusion or removal from the model once main effects were entered in the model. Default settings were used to fit GMMBoost and GEEBoost models. GMMBoost and GEEBoost yield a series of models presenting the “path” by which variables enter the model, and the best model was selected based on AIC for GMMBoost and QIC for GEEBoost. GlmmLasso requires tuning of the shrinkage parameter λ and initial estimates of the fixed and random effects prior to model fitting. Shrinkage parameter and initial estimates were selected by choosing the λ that yielded the smallest AIC. Performance of each method was evaluated based on (1) the proportion of times the method correctly identified the parameters in the true underlying model and the false discovery rate (FDR) of null predictors defined as the number of null predictors selected divided by the total number of predictors selected in the model, (2) the bias = $\left(\hskip 4pt {\hat \vskip-9pt\hskip-4pt\beta - \beta } \right)$ for the three predictors in the true model and the average bias for null predictors before and after using the two-stage approach, (3) model parsimony defined by the number of nonzero parameters in the final model, and (4) average computation time for model fitting. If one or more of the predictors and/or predictor interactions in the true model used to generate response Y where identified, the two-stage approach was applied to evaluate how well this approach reduced the bias in the parameter estimates. All simulations were conducted in R 64 bit v 3.2.5 using a Quad Opteron 2.8 GHz processor with 96 GB Ram and 3 TB of HDD storage.
Results
Predictor and Interaction Identification
Fig. 1 shows the proportion of times each method identifies the true predictors associated with response Y and the FDR for null predictors identified in data with two observations per subject for the three effect sizes and four sample sizes. Identification of the correct predictors and predictor interactions for all methods improves with increasing sample size and effect size with the exception of the stepwise selection method. The proportion of times each method correctly identifies the true predictors also increases when the number of observations per subject increases for all methods (Supplementary Fig. S1).
The stepwise algorithm correctly identifies the main effect for X 5 significantly more often than the penalized regression approaches for the smallest effect size (e.g. β = 0.2) for n ≥ 200. However, the stepwise approach identifies the interaction terms in fewer than 20% of models across all sample sizes, effect sizes, and number of observations per subject. When sample size or effect size is small, the glmmLasso approach recovers the correct predictors and interactions more frequently than GMMBoost or GEEBoost for n ≤ 300. However, for large sample size or effect size (e.g. $\beta $ ≥ 0.7 and n ≥ 200), glmmLasso, GMMBoost, and GEEBoost exhibit similar ability to recover the true predictors and interactions. For n ≥ 200 at moderate and large effect sizes, glmmLasso, GMMBoost, and GEEBoost all recover the three predictors and predictor interactions in ≥85% of all models. The proportion of times null predictors were selected was similar within method across effect sizes. Stepwise regression and GMMBoost were less likely to select a null predictor than glmmLasso and GEEBoost. GlmmLasso was significantly more likely to select a null predictor at n = 100 than all other methods for $\beta $ ≤ 0.7.
Among the three penalized approaches, GMMBoost tended to yield more parsimonious models relative to glmmLasso and GEEBoost for all samples and effect sizes (Table S1). GlmmLasso had the largest average number of correct predictors across simulations, though GEEBoost was similar for larger sample and effect sizes. When sample sizes and effect sizes were large, GMMBoost often exactly identified the correct model (i.e. the selected model only included X 5, X 1 X 2, and X 2 X 5). In contrast, glmmLasso and GEEBoost tended to include more predictors in the model particularly for small sample sizes, which likely explains why these methods recovered the true predictors more frequently than GMMBoost. The FDR for glmmLasso and GEEBoost was between 65–81% and 68–88%, respectively (Fig. 1) and was larger for 10 observations/subject (Supplementary Fig. S1). However, FDR for GMMBoost decreased with increasing sample and effect size achieving FDR = 0 for large n and $\beta $ . Stepwise selection had smaller FDR relative to glmmLasso and GEEBoost due to inclusion of few predictors.
Bias and MSE of Regression Parameter Estimates
Fig. 2 shows box plots of the bias for regression parameter estimates for X 5, X 1 X 2, X 2 X 5 and the average bias across all null predictors for each approach for simulations with two observations/subject. Bias for the true predictors was generally negative indicating values smaller than the true $\beta $ . Bias decreased with increasing sample size and increased with increasing effect size across methods in data with 2 or 10 observations/subject (Figure 2; Supplementary Fig. S2 and Table S2). Stepwise selection was an exception as MSE and bias were consistent with increasing sample size (Fig. 2 and Supplementary Table S2). Among the three penalized approaches, glmmLasso yielded the smallest average bias and MSE for true predictors, followed by GEEBoost and GMMBoost (Fig. 2). The only exception for glmmLasso occurred at n = 100, which may result from over-fitting as average model size for glmmLasso when n = 100 was ~40 predictors. The stepwise approach had larger bias and MSE for X1X2 and X2X5, but similar bias to glmmLasso and GEEBoost for X 5. The average bias for the 102 null predictors was small for a majority of the simulation scenarios (XNull, Fig. 2), though glmmLasso exhibited much larger MSE and average bias for null predictors at n = 100 (Supplementary Table S2). Results were similar for 10 observations/subject and effect size β = 0.2 though the observed bias was less variable (Supplementary Fig. S3, row 1).
Two-Stage Approach for Bias Correction
We also evaluated the ability of a two-stage approach to reduce bias in parameters estimates using the penalized selection approaches. We considered two models: (1) the first model imposed the heredity constraint and incorporated all nonzero terms from the penalized model plus all main effects that are zero in the penalized model but are in one or more nonzero interaction term and (2) a second model in which heredity was ignored fitting a GLMM with all nonzero terms ignoring main effects if they are zero in the penalized models. Both approaches performed similarly in terms of bias reduction and thus only results for the second method are shown. The change in bias for each simulation run from stage 1 to stage 2 and the bias in stage 2 for estimates from the glmmLasso, GMMBoost, and GEEBoost in data with two observations/subject are presented in Fig. 3 and Supplementary Fig. S2, respectively.
The bias in stage 2 was generally smaller for all effect and sample sizes than in stage 1 though similar to stage 1 estimates were still predominantly downward biased (Fig. 3, an Supplementary Fig. S2). Fig. 3 demonstrates that the two-stage approach generally reduced the bias in the parameter estimates, particularly at larger effect and sample sizes. The methods had difficulty identifying the true predictors for $\beta $ = 0.2, thus a majority of models had no change in bias in stage 2. In cases where the true predictors were selected for $\beta $ = 0.2, the bias was smaller in stage 2 relative to stage 1 for all three approaches for n ≥ 300. MSE was similar in the first- and second-stage models for glmmLasso and smaller for GMMBoost and GEEBoost for $\beta $ = 0.7 and 1.5 (Supplementary Table S1). GlmmLasso exhibited larger bias and MSE in the second-stage models at n = 100 for $\beta $ = 0.2, likely due to the large number of spurious predictors in the models. Similar trends in bias were noted for GMMBoost and GEEBoost in data with 10 observations/subject, though glmmLasso did not show decreased bias for X 5 (Supplementary Fig. S3).
Computation Time
Computation time was affected by both number of repeated measures per subject and sample size. For all methods, models fit to data with two measures/subject required less computation time compared to data with 10 measures. Of the three sparse methods, GEEBoost algorithm had the shortest computation time followed by glmmLasso and then GMMBoost. For data with two measures/subject, computation times for glmmLasso and GMMBoost were similar. However, when the number of repeated measures per subject increased to 10, GEEBoost has greatly improved computational efficiency relative to glmmLasso and GMMBoost. Average computation times for each method are presented in Supplementary Table S3.
Results: Applications in Real Data
We applied all four variable selection approaches in data from two clinical studies, treatment response in lupus nephritis and hypothermia in total joint arthroplasty patients, to examine differences between the methods in real data applications. The two-stage approach was applied for the glmmLasso, GMMBoost, and GEEBoost models.
Treatment Response in Lupus Nephritis
The goal of this study was to evaluate clinical and biological markers of treatment response over time in patients with lupus nephritis. Clinical and demographic information collected on 140 patients with biopsy proven lupus nephritis (LN) included age, gender, presence of anti-double-stranded DNA antibodies (dsDNA), serum C3 complement levels (C3C), serum C4 complement levels (C4C), urine protein to creatinine ratio (UrPrCr), and estimated glomerular filtration rate (EGFR). Urine samples were collected at two times during the study for all subjects and were analyzed for 15 novel urinary biomarkers. The final data include 22 demographic, clinical, or biomarker variables yielding 22 main effects and 190 two-way interactions. Patient response to treatment, determined using the criterion defined by [Reference Wofsy, Hillson and Diamond40] and was evaluated at 3 and 12 months after the initiation of treatment. For a full description of the study, see Ref. [Reference Wolf, Spainhour, Arthur, Janech, Petri and Oates41].
Predictors and interactions selected by each method for response in LN are presented in Table 2. Variables identified using stepwise algorithm are different from lasso and boosting methods. The stepwise approach selected predominantly main effects and one interaction. GlmmLasso and GMMBoost selected predominantly interactions rather than main effects. The number of predictors and interactions identified by glmmLasso and GMMBoost was similar (6 vs. 5), and there are three variables common to both methods. The regression estimates for the second-stage model are comparable between glmmLasso and GMMBoost (Table 2). GEEBoost yielded the sparsest model, including only one predictor interaction, although this interaction was also identified by glmmLasso. These results are in slight contradiction to what was observed in simulations as GMMBoost tended to give the most parsimonious model.
***p < 0.001;** p < 0.01; *p < 0.05; ′p < 0.1.
Hypothermia Over Time in Total Joint Arthroplasty
This was an observational study examining patient, clinical, and procedural variables associated with occurrence of hypothermia in patients undergoing total hip or total knee arthroplasty (THA and TKA). Participants’ temperatures were evaluated at eight timepoints: (1) leaving holding; (2) operating room arrival; (3) after anesthetic induction; (4) upper-body warmer initiation; (5) incision; (6) every 30 min after incision; (7) leaving the operating room; and (8) arrival to PACU and hypothermia was defined as temperature < 36.0°C. The data included 102 participants with 13 demographic, clinical, or procedural variables and 66 possible two-way interactions. Full study details are provided in Ref. [Reference Matos, McSwain, Wolf, Doty and Wilson42].
Similar to the lupus nephritis study, the model selected using forward selection was quite different from the models selected using the penalized regression and boosting approaches. The forward selection model only included main effects, while the other approaches included a mix of main effects and interactions. The glmmLasso model included more predictors than GMMBoost and GEEBoost (18 vs. 4 or 6, respectively; Table 3). Although glmmLasso included the largest number of predictors, three of four predictors identified by GMMBoost and four of six predictors selected by GEEBoost were also selected by glmmLasso. The glmmLasso, GMMBoost, and GEEBoost models had three terms in common: the interaction between BMI and gender, between procedure type and anesthesia type, and between procedure type and phenylephrine dose. Regression estimates for these terms were of similar magnitude between the methods although only the interaction between procedure type and phenylephrine dose was statistically significant for three methods, while gender by BMI was only significant in the GMMBoost and GEEBoost models, and procedure type by anesthesia type was not significant in any model.
***p < 0.001; **p < 0.01; *p < 0.05; ′p < 0.1.
Discussion
Identifying predictors and predictor interactions associated with repeatedly measured binary disease outcomes is important as many common diseases are hypothesized to result from interactions between patient, clinical, and environmental factors. Statistical methods for variable selection, including penalized and boosted regression approaches, have recently been developed for repeatedly measured binary outcomes. This study compared the ability of four variable selection techniques, forward stepwise selection, glmmLasso, GMMBoost, and GEEBoost, to correctly recover predictors and interactions associated with a repeatedly measured binary outcome to provide guidance for choosing an appropriate method in application. Additionally, we evaluated a two-stage approach to reduce bias for regression parameter estimates selected by glmmLasso, GMMBoost, or GEEBoost. Our simulations corroborate previous findings regarding the poor performance of the forward stepwise approach [Reference Storlie, Bondell, Reich and Zhang9, Reference Breiman10] and showed the effectiveness of boosting and penalized regression techniques while demonstrating differences between the approaches.
The boosting and penalized regression approaches demonstrated superior recovery of the correct predictors relative to the forward stepwise selection approach. Among the penalized and boosting regression approaches, glmmLasso recovered the correct model terms more frequently than the other approaches, particularly at smaller effect and sample sizes. However, glmmLasso models often included many spurious terms yielding a high FDR. GEEBoost performed similarly to glmmLasso for predictor recovery and generated slightly more parsimonious models though FDR was also large. GMMBoost had greater difficulty recovering the correct terms; however, it returned the most parsimonious model, often selecting the exact model used to generate the underlying response. The models selected in the two applications were consistent with the simulations in that glmmLasso selected the largest number of predictors followed by GEEBoost and then GMMBoost. In terms of computation, GEEBoost had the fastest computation time followed by glmmLasso and GMMBoost. Of note, GMMBoost had much greater computation time relative to the other two methods at larger sample sizes, so much so that we were unable to conduct simulations for GMMBoost with sample size n = 500 and 10 observations per subject.
The penalized and boosting regression approaches almost always yielded regression parameter estimates biased toward zero. The only exceptions occurred for glmmLasso at n = 100, where this approach had difficulty selecting a reduced subset of predictors. In all simulations, using a two-stage approach effectively reduced the bias for all three methods regardless of sample sizes, effect sizes, or observations/subject.
The simulations provide guidance for choosing an appropriate method to identify predictor interactions without a priori knowledge in data with a repeatedly measured binary outcome. Table 4 provides suggestions for choosing among glmmLasso, GMMBoost, and GEEBoost based on study goals and computing efficiency. The glmmLasso approach recovered the true regression parameters most often but produced the least parsimonious models; thus, if the goal is to select candidate predictors and interactions with assurance that relevant parameters will be selected, then glmmLasso is a most appropriate given sufficiently large sample size (n ≥ 200). GMMBoost recovered the correct predictors and interactions with similar frequency to glmmLasso and GEEBoost at large effect and sample sizes and identified few spurious predictors; thus, GMMBoost is a most appropriate if parsimony is of importance. GMMBoost is the least computationally efficient and may prove difficult in real applications. At smaller sample sizes, GEEBoost had similar performance to glmmLasso and tended to select a more parsimonious model thus offering a good compromise between the computation time and model parsimony. Finally, the two-stage approach effectively reduced the bias for all three methods and thus should be applied regardless of selected method.
There are extensions that could be considered to enhance the applicability of each method. Elastic net uses both L 1 and L 2 constraints to yield a sparse solution while facilitating grouping of collinear variables [Reference Zou and Hastie43]. Additionally, several penalized regression approaches for data with independent observations exist that impose heredity constraints to maintain hierarchy between main effects and interactions [Reference Bien, Taylor and Tibshirani33, Reference Yuan and Lin44, Reference Zhao, Rocha and Yu45]. These approaches improve model error rate and to yield more parsimonious models relative to traditional lasso. Given the large FDR observed for glmmLasso and GEEBoost, extending these approaches to allow for grouping or hierarchy would likely yield more parsimonious models and reduce FDR.
Conclusion
In this paper, we describe four variable selection approaches for data with repeatedly measured binary outcomes, with focus on identifying predictor interactions without an a priori hypothesis. Our results demonstrate a strong advantage of penalized and boosted approaches over the traditional stepwise approach in these types of data and provide guidance for choosing the most appropriate method in real applications. Furthermore, the two-stage approach effectively reduces the bias and MSE for all methods and should be applied regardless of choice of method.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/cts.2020.556.
Acknowledgments
This project was supported in part by the South Carolina Clinical & Translational Research Institute (NIH/NCATS UL1TR001450), the Delaware-CTR ACCEL program (NIH/NIGMS U54 GM104941), the MCRC for Rheumatic Diseases in African Americans (NIH/NIAMS P60 AR062755), and the CCCR Improving Minority Health in Rheumatic Diseases (NIH/NIAMS P30 AR072582). The contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH, NCATS, NIGMS, or NIAMS.
Disclosures
The authors have no conflicts of interest to declare.