1 Introduction
Political scientists typically use Grambsch and Therneau’s (Reference Grambsch and Therneau1994; Therneau and Grambsch Reference Therneau and Grambsch2000) Schoenfeld residual-based test to assess the Cox duration model’s proportional hazards (PH) assumption. This assumption states that a covariate x’s effect is multiplicative on the baseline hazard, h 0(t). One way proportionality can occur is if x’s effect is unconditional on t, a subject’s time at risk of experiencing some event. If x’s effect is conditional on t, it is no longer proportional, as its effect is “time-varying.” Failing to account for a covariate’s time-varying effect (TVE) produces inefficient estimates, at best, and bias in all the covariates’ point estimates, at worst (Box-Steffensmeier and Zorn Reference Box-Steffensmeier and Zorn2001; Keele Reference Keele2008, 6). Detecting PH violations, then, is a priority for political scientists, given our general interest in explanation and, therefore, accurate estimates of covariates’ effects. R’s survival::cox.zph, Stata’s estat phtest, and Python’s lifelines.check_assumptions all currently use Grambsch and Therneau’s Schoenfeld-based test (hereafter, “PH test”).
Like any specification-related test, the PH test’s ability to correctly diagnose PH violations depends on several factors. Examples include the TVE’s magnitude, the presence of misspecified covariate functional forms, omitted covariates, covariate measurement error, the number of failures, and sample size (Therneau and Grambsch Reference Therneau and Grambsch2000, sec. 6.6); covariate measurement level (Austin Reference Austin2018); unmodeled heterogeneity (Balan and Putter Reference Balan and Putter2019); choice of g(t), the function of t on which the covariate’s effect is presumed to be conditioned (Park and Hendry Reference Park and Hendry2015); the nature of the PH violation, and the percentage of right-censored (RC) observations (Ng’andu Reference Ng’andu1997). Each of these affects either the PH test’s statistical size or power, impacting the frequency with which we obtain false positives (size) or true positives (power), thereby affecting the test’s performance.
New factors affecting the PH test’s performance have recently come to light. Metzger (Reference Metzger2023c) shows how the PH test is calculated also impacts the test’s performance. Traditionally, Stata, Python, and R (< survival 3.0-10) all compute the PH test using an approximation, which makes certain simplifying assumptions to expedite computation (Metzger Reference Metzger2023c, Appx. A). By contrast, R (≥ survival 3.0-10) now computes the PH test in full, using the actual calculation (AC), without any simplifying assumptions.Footnote 1 Metzger’s (Reference Metzger2023c) simulations suggest surprising performance differences between the approximated and actual calculations, with the latter outperforming the former. However, Metzger examines a limited number of scenarios to address her main issues of concern, pertaining to model misspecification via incorrect covariate functional forms among uncorrelated covariates, and leaves more extensive investigations of the calculations’ performance differences to future work.
This article uses Monte Carlo simulations to more thoroughly investigate whether the PH test’s approximated and actual calculations perform similarly, in general. My simulations show that they do not, but in unexpected ways. Congruent with Metzger (Reference Metzger2023c), I find that the AC generally outperforms the approximated calculation when the covariates are uncorrelated, regardless of the amount of right censoring (RC), the way in which RC is induced, the sample size, the PH-violator’s time-varying-to-main-effect ratio, or the non-PH-violating covariate’s magnitude or dispersion. In these instances, the AC is well sized and well powered, whereas the approximation is also well sized but can be underpowered.
However, in a surprising turn of events, the approximation outperforms the AC considerably when the covariates are correlated, even moderately so (|Corr(x 1,x 2)| = 0.35). The AC continues to be well powered, but produces an increasingly large amount of false positives as the correlation’s absolute value increases—sometimes as high as 100% of a simulation run’s draws. By contrast, the approximation’s behavior effectively remains the same as the no-correlation scenario: well sized or very near to it, but sometimes underpowered. These findings have weighty implications because they point to a complex set of trade-offs we were previously unaware of: using an appropriately sized test (the approximation, for the scenarios I check here), while knowing the approximation can also have many false positives in misspecified models (Metzger Reference Metzger2023c), among other potential complications. False positives would lead researchers to include PH violation corrections, likely in the form of a time interaction. Including unnecessary interaction terms results in inefficiency, which can threaten our ability to make accurate inferences (Supplementary Appendix E).
My findings are also weighty because political science applications frequently satisfy the conditions under which the AC is likely to return false positives. I identified all articles using a Cox duration model in eight political science journals across 3.5 years, and examined the correlations between identified PH violators and non-violators.Footnote 2 Nearly 87% of the articles have a moderate correlation for at least one violator–non-violator pairing, with an average of 5.15 such pairings per article. By contrast, only ~14% of these articles have easily identifiable features that might prove problematic for the approximation, in theory (fn. 1). To further underscore my findings’ implications for political scientists, I also reanalyze a recently published study using the Cox model (Agerberg and Kreft Reference Agerberg and Kreft2020) to show that we reach different conclusions about the authors’ main covariate of interest, depending on which PH calculation we use.
I begin by walking through the differences between the PH test’s approximated and actual calculations, to provide some sense of why their applied behavior may differ. Next, I describe my simulations’ setup. Third, I discuss my simulation results that show the approximation is appropriately sized in far more scenarios than the AC. Fourth, I move to the illustrative application and the different covariate effect estimates the two calculations imply. I conclude with a summary and discuss my findings’ implications for practitioners.
2 The PH Test Calculation
2.1 Overview
Why might the two calculations perform differently? In short, the approximation makes several simplifying assumptions when calculating one of the formula’s pieces.Footnote 3
Grambsch and Therneau’s PH test amounts to a score test (Therneau and Grambsch Reference Therneau and Grambsch2000, 132), also known as a Rao efficient score test or a Lagrange multiplier (LM) test. Score tests take the form:
where U is the score vector, as a row, and $\mathcal{I}$ is the information matrix. In a Cox model context, a covariate’s entry in the score vector is equal to the sum of its Schoenfeld residuals, making U particularly easy to compute (Therneau and Grambsch Reference Therneau and Grambsch2000, 40, 85). The score test for whether covariate j is a PH violator amounts to adding an extra term for xj *g(t) to the original list of covariates (Therneau Reference Therneau2021), where g(t) is the function of time upon which xj ’s effect is potentially conditioned. Usual choices for g(t) include t and ln(t), but others are possible (and encouraged, in some cases: see Park and Hendry Reference Park and Hendry2015).
To specifically assess whether xj is a PH violator using the full score test, the expanded U vector’s dimensions, ${U}_j^{\mathrm{E}}$ , are $1\times \left(J+1\right)$ , where J is the number of covariates in the original model. The $\left(J+1\right)$ th element contains the score value for the additional xj *g(t) term, calculated by multiplying xj ’s Schoenfeld residuals from the original Cox model by g(t), then summing together that product. With a similar logic, the expanded $\mathcal{I}$ matrix for testing whether xj is a PH violator ( ${\mathcal{I}}_j^{\mathrm{E}}$ ) has dimensions of $\left(J+1\right)\times \left(J+1\right)$ . It is a subset of the full expanded information matrix ( ${\mathcal{I}}^{\mathrm{E}}$ ), which is equal to (Therneau Reference Therneau2021, lines 23–33):
where $k$ is the $k$ th event time ( $0<{t}_{1}<\dots <{t}_{k}<{t}_{K}$ ) and $\widehat{V}\left({t}_{k}\right)$ is the $J \times J$ variance–covariance matrix at time tk from the original Cox model. We obtain ${\mathcal{I}}_j^{\mathrm{E}}$ by extracting the rows and columns with indices 1: J and j + J from ${\mathcal{I}}^{\mathrm{E}}$ . This amounts to all of ${\mathcal{I}}_{1}$ and the row/column corresponding to xj in the matrix’s expanded portion.Footnote 4
2.2 Implementation Differences
In a basic Cox model with no strata,Footnote 5 the biggest difference between the two calculations originates from ${\mathcal{I}}^{\mathrm{E}}$ . The approximated calculation makes a key simplifying assumption about $\widehat{V}\left({t}_{k}\right)$ : it assumes that $\widehat{V}\left({t}_k\right)$ ’s value is constant across t (Therneau and Grambsch Reference Therneau and Grambsch2000, 133–134). The approximation also uses the average of $\widehat{V}\left({t}_{k}\right)$ across all the observed failures (d), $\overline{V} = {d}^{-1}\sum \widehat{V}\left({t}_{k}\right) = {d}^{-1}{\mathcal{I}}_1$ , in lieu of $\sum \widehat{V}\left({t}_{k}\right)$ , because $\widehat{V}\left({t}_{k}\right)$ “may be unstable, particularly near the end of follow-up when the number of subjects in the risk set is not much larger than [ $\widehat{V}\left({t}_{k}\right)$ ’s] number of rows” (Therneau and Grambsch Reference Therneau and Grambsch2000, 133–134).
As a consequence of these simplifying assumptions:
-
1. ${\mathcal{I}}^{\mathrm{E}}$ ’s upper-left block diagonal ( ${\mathcal{I}}_1$ ) is always equal to $\overline{V} = \sum \widehat{V}\left({t}_k\right)/d$ for the approximation, after the $\overline{V}$ substitution. By contrast, it equals $\sum \widehat{V}\left({t}_k\right)$ for the AC.
-
2. ${\mathcal{I}}^{\mathrm{E}}$ ’s block off-diagonals ( ${\mathcal{I}}_2$ ) are forced to equal 0 for the approximation. For the AC, they would be nonzero ( $ = \sum \widehat{V}\left({t}_k\right)g\left({t}_k\right)$ ).
-
3. ${\mathcal{I}}^{\mathrm{E}}$ ’s lower-right block diagonal ( ${\mathcal{I}}_3$ ) is equal to $\overline{V}\sum {g}^2\left({t}_k\right)\equiv \sum \widehat{V}\left({t}_k\right){d}^{-1}\sum {g}^2\left({t}_k\right)$ for the approximation (Therneau Reference Therneau2021, lines 38–41), after the $\overline{V}$ substitution. By contrast, ${\mathcal{I}}_3$ would equal $\sum \widehat{V}\left({t}_k\right){g}^2\left({t}_k\right)$ for the AC.
Supplementary Appendix A provides ${\mathcal{I}}^{\mathrm{E}}$ for both calculations in the two-covariate case, to illustrate.
Consider the difference between the test statistic’s two calculations for covariate xj in a model with two covariates (J = 2).Footnote 6 For the approximation, it is equal to (Therneau and Grambsch Reference Therneau and Grambsch2000, 134):
where ${s}_{j,k}^{\ast }$ are the scaled Schoenfeld residualsFootnote 7 for xj at time k and ${\widehat{V}}_{{\widehat{\beta}}_j}$ is ${\widehat{\beta}}_j$ ’s estimated variance from the original Cox model.Footnote 8
If we rewrite the approximation’s formula using unscaled Schoenfelds, to make it analogous to the AC’s formula:
where ${s}_{j,k}$ is the unscaled Schoenfeld residual for covariate j at time k and $\neg j$ refers to the other covariate in our two-covariate specification.
By contrast, the AC for xj when J = 2 will equal:
where the various $\widehat{V}$ s and $\widehat{\mathrm{Cov}}$ refer to specific elements of $\widehat{V}\left({t}_{k}\right)$ , the time-specific variance–covariance matrix, and $\left|{\mathcal{I}}_j^{\mathrm{E}}\right|$ is ${\mathcal{I}}_j^{\mathrm{E}}$ ’s determinant.Footnote 9 $\left|{\mathcal{I}}_j^{\mathrm{E}}\right|$ has J + 1 terms; when J = 2, it equals (before demeaning $g\left({t}_k\right)$ [fn. 8]):
2.3 Implications
Equations (3) and (4) diverge in two major places. Both manifest in the AC (Equation (4)):
-
1. The additional, non-Schoenfeld term in the numerator (shaded light gray);
-
2. A substantially more complex denominator. The AC’s denominator is one consequence of ${\mathcal{I}}_{2}\ne 0$ , as Supplementary Appendix B explains. Additionally, g(t) only appears inside the k-summations involving $\widehat{V}\left({t}_k\right)$ for the AC’s denominator, which stems from ${\mathcal{I}}_{3} \ne \sum \widehat{V}\left({t}_k\right){d}^{-1}\sum {g}^2\left({t}_k\right)$ .
${T}_j$ is distributed asymptotically ${\chi}^{2}$ when the PH assumption holds (Therneau and Grambsch Reference Therneau and Grambsch2000, 132), meaning ${T}_j$ ’s numerator and denominator will be identically signed.
Understanding when each calculation is likely to be appropriately sized (few false positives) and appropriately powered (many true positives) amounts to understanding what makes Tj larger. A higher Tj translates to a lower p-value, and thus a higher chance of concluding a covariate violates PH, holding Tj ’s degrees of freedom constant. The key comparison is the numerator’s size relative to the denominator. Specifically, we need a sense of (1) when the numerator will become larger relative to the denominator and/or (2) when the denominator will become smaller, relative to the numerator.
However, the numerator’s and denominator’s values are not independent within either calculation. Moreover, the numerator and the denominator do not simply share one or two constituent quantities, but several quantities, often in multiple places (and sometimes transformed), making basic, but meaningful comparative statics practically impossible within a given calculation, let alone comparing across calculations. This interconnectivity is one reason I use Monte Carlo simulations to assess how each calculation performs.
The additional term in ${T}_j^{act}$ ’s numerator hints at one factor that may make the calculations perform differently: the correlation among covariates. $\widehat{\mathrm{Cov}}\left({t}_k,{x}_j,{x}_{\neg j}\right)$ appears in the AC for J = 2, both in the numerator’s non-Schoenfeld term (Equation (4), light gray shading) and all three terms in the denominator.Footnote 10 $\widehat{\mathrm{Cov}}\left({t}_k,{x}_j,{x}_{\neg j}\right)$ is equal to (Therneau and Grambsch Reference Therneau and Grambsch2000, 40):
where $r\in R\left({t}_k\right)$ represents “observations at risk at ${t}_k^{-}$ ” and XB is the at-risk observation’s linear combination. Correlated covariates would impact ${x}_j{x}_{\neg j}$ ’s value, which eventually appears in both bracketed terms. Generally speaking, as $\left| \operatorname{Corr}\left({x}_j{x}_{\neg j}\right)\right|$ increases, $\left| {x}_j{x}_{\neg j}\right|$ increases, thereby increasing $\left|\widehat{\operatorname{Cov}}\left({t}_k,{x}_j,{x}_{\neg j}\right)\right|$ ’s value.
More broadly, each formula provides guidance as to which features of the data-generating process (DGP) might be useful to vary across the different simulation scenarios. Consider the pieces that appear in either equation:
-
• $\widehat{V}\left({t}_k\right)$ . In the AC, the individual elements of $\widehat{V}\left({t}_k\right)$ appear in both the numerator and the denominator (e.g., $\widehat{\operatorname{Cov}}\left({t}_k,{x}_j,{x}_{\neg j}\right)$ , as previously discussed for the correlation among covariates). In the approximation, $\widehat{V}\left({t}_k\right)$ appears only indirectly via $\widehat{V}\left(\widehat{\beta}\right)$ , the model’s estimated variance–covariance matrix, as $\widehat{V}\left(\widehat{\beta}\right) = {\mathcal{I}}^{-1}$ and $\mathcal{I}=\sum \widehat{V}\left({t}_{k}\right)$ . Portions of $\widehat{V}\left(\widehat{\beta}\right)$ appear in the approximation’s numerator, as part of the scaled Schoenfeld calculation ( ${\widehat{V}}_{{\widehat{\beta}}_{j}}$ , ${\widehat{\mathrm{Cov}}}_{{\widehat{\beta}}_j,{\widehat{\beta}}_{\neg j}}$ ), and in its denominator ( ${\widehat{V}}_{{\widehat{\beta}}_{j}}$ ).
-
• ${\sum}_{r\in R\left({t}_k\right)}\exp (XB)\theta$ , where $\theta$ is a generic placeholder for a weight,Footnote 11 appears in multiple places in both calculations: namely, within the formula for $\widehat{V}\left({t}_k\right)$ ’s individual elements and within the unscaled Schoenfeld formula. $\exp (XB)$ is an at-risk observation’s risk score in tk , meaning its (potentially weighted) sum speaks to the total amount of weighted “risk-ness” in the dataset at tk .Footnote 12 The riskset’s general size at each tk , then, is relevant.
-
• $\exp (XB)$ also suggests that the covariates’ values, along with their respective slope estimates, are of relevance. Additionally, the covariates are sometimes involved with the weights (see fn. 11), producing another way in which their values are relevant.
-
• t, the duration. It ends up appearing demeaned in both calculations, $g\left({t}_k\right)-\overline{g(t)}$ (see fn. 8). The demeaning makes clear that t’s dispersion is relevant.
-
• Only observations experiencing a failure are involved in the final steps of the $\widehat{V}\left({t}_{k}\right)$ and Schoenfeld formulas, implying the number of failures (d) is relevant.
3 Simulation Setup
I use the simsurv package in R to generate my simulated continuous-time durations (Brilleman et al. Reference Brilleman, Wolfe, Moreno-Betancur and Crowther2021).Footnote 13 All the simulations use a Weibull hazard function with no strata, a baseline scale parameter of 0.15, and two covariates: (1) a continuous, non-PH-violating covariate (x 1 ~ $\mathcal{N}$ ) and (2) a binary, PH-violating covariate (x 2 ~ Bern(0.5)). x 2’s TVE is conditional on ln(t). Making the PH violator a binary covariate gives us a best-case scenario, because others’ simulations suggest that the Schoenfeld-based PH test’s performance is worse for continuous covariates than for binary covariates (Park and Hendry Reference Park and Hendry2015).
I design my simulations to address whether there are performance differences between the approximated and actual PH test calculations in a correctly specified base model, where x 1 and x 2 are the only covariates.Footnote 14 I vary a number of other characteristics that can impact the PH test’s performance, per Section 1’s discussion. Some of the characteristics’ specific values are motivated by existing duration model-related simulations. In total, I run 3,600 different scenarios, derived from all permutations of the characteristics I list in Supplementary Appendix C.Footnote 15 The results section’s discussion focuses primarily on five of these characteristics:
-
• Three Weibull shape parameter (p) values {0.75, 1, 1.25}, producing scenarios with decreasing, flat, and increasing baseline hazards, respectively. p = 1 matches Keele (Reference Keele2010) and Metzger (Reference Metzger2023c). Varying p impacts t’s dispersion by affecting how quickly subjects fail. Higher shape values reduce t’s dispersion, all else equal.
-
• Two sample sizes {100, 1,000}. The first matches Keele (Reference Keele2010) and Metzger (Reference Metzger2023c). I run n = 1,000 to check whether the n = 100 behavior persists when the PH test’s asymptotic properties are likely in effect.
-
• Five levels of correlation between the two covariates {−0.65, −0.35, 0, 0.35, 0.65}. I use the BinNor package to induce these correlations (Demirtas, Amatya, and Doganay Reference Demirtas, Amatya and Doganay2014).Footnote 16 I run both positive and negative correlations to verify that the behavior we observe is independent of the correlation’s sign, as the formulas suggest. The results are indeed roughly symmetric for the scenarios I run here. Therefore, I only report the positive correlation results in text, but the supplemental viewing app (see fn. 15) has the graphs for both.
-
• Two RC patterns. In one pattern, I randomly select rc% subjects and shorten their observed duration by (an arbitrarily selected) 2%. In the second, I censor the top rc% of subjects such that their recorded durations are at the (100 − rc%)th percentile. The first (“random RC”) corresponds to a situation where subjects become at risk at different calendar times, whereas the second (“top rc%”) corresponds to a situation where all subjects become at risk at the same calendar time, but data collection ends before all subjects fail. For two otherwise identical scenarios (including d’s value), the top rc% pattern gives me another way to affect t’s dispersion without impacting other quantities in either formula, because t’s highest observed value is restricted to its (100 − rc%)th percentile.
-
• Three RC percentages (rc%) {0%, 25%, 50%}. The 25% matches Keele (Reference Keele2010), Metzger (Reference Metzger2023c), Park and Hendry’s (Reference Park and Hendry2015) moderate censoring scenario, and is near Ng’andu’s (Reference Ng’andu1997) 30% scenario. The 50% matches Park and Hendry’s (Reference Park and Hendry2015) heavy censoring scenario and is near Ng’andu’s (Reference Ng’andu1997) 60% scenario. Manipulating rc% allows me to vary d across otherwise comparable scenarios.
As Supplementary Appendix C discusses, I also vary the pattern regarding x 2’s effect (specifically, the ratio of x 2’s TVE to its main effect), the recorded duration’s type, x 1’s mean, and x 1’s dispersion.
For each of these 3,600 scenarios, I estimate a correctly specified base model to determine whether PH violations exist, as discussed previously. I then apply the two PH test calculations and record each calculation’s p-values for every covariate. I report the PH tests’ p-values for g(t) = ln(t) from both calculations, to match the DGP’s true g(t).Footnote 17 , Footnote 18
In the ideal, I would run 10,000 simulation draws for each of the 3,600 scenarios because of my interest in p-values for size/power calculations (Cameron and Trivedi Reference Cameron and Trivedi2009, 139–140). However, the estimating burden would be prohibitive. Additionally, while I am interested in seeing how each calculation performs against our usual size/power benchmarks, my primary interest is comparing how the calculations perform relative to one another. Having fewer than 10,000 draws should affect both calculations equally, provided any imprecision is unaffected by any of the calculations’ performance differences (i.e., the simulations might give an imprecise estimate of statistical size, but both calculations would have the same amount of imprecision). Nonetheless, I compromise by running 2,000 simulations per scenario.
4 Simulation Results
The key quantity of interest is the rejection percentage ( ${\hat{r}}_p$ ), the percent of p-values < 0.05, from the PH test for each calculation–covariate pairing within a scenario.Footnote 19 For x 1, the non-PH violator, this value should be 5% or lower, corresponding to a false positive rate of α = 0.05. For PH-violating x 2, 80% or more of its PH test p-values should be less than 0.05, with 80% representing our general rule of thumb for a respectably powered test.Footnote 20 Our first priority typically is evaluating whether a statistical test’s calculated size matches our selected nominal size, α. Our second priority becomes choosing the best-powered test, ideally among those with the appropriate statistical size (Morris, White, and Crowther Reference Morris, White and Crowther2019, 2088)—a caveat that will be relevant later.
I report ${\hat{r}}_p$ along the horizontal axis of individual scatterplots grouped into 3 × 3 sets, where each set contains 45 scenarios’ worth of results. The set’s rows represent different Corr(x 1,x 2) values, and its columns represent different shape parameter values. Each scatterplot within a set, then, represents a unique Corr(x 1,x 2)–shape combination among a set of scenarios that share the same true linear combination, sample size, recorded duration type, and values for x 1’s mean and dispersion. I split each scatterplot into halves and report the results from random RC on the left and top rc% RC on the right, with the halves’ dividing line representing 0% of a scenario’s p-values < 0.05 $\left({\hat{r}}_p = 0\%\right)$ and the scatterplot’s side edges representing ${\widehat{r}}_p = 100\%$ . I use short, solid vertical lines within the plot area to indicate whether a particular covariate’s ${\widehat{r}}_p$ should be low (non-PH violators ⇒ size; closer to halves’ dividing line) or high (PH violators ⇒ power; closer to scatterplot’s edges). Within each half, I report the three censoring percentages using different color symbols, with darker grays representing more censoring.Footnote 21
I report one of the scatterplot sets in text (Figure 1) to concretize the discussion regarding correlated covariates' effect, as it exemplifies the main patterns from the results.15 I then discuss those patterns more broadly.
4.1 Specific Scenario Walkthrough
Figure 1 shows the simulation results for ${x}_1\sim \mathcal{N}\left(0,1\right)$ where $XB = 0.001{x}_1+1{x}_2\ln (t)$ , n = 100, and the estimated model uses the true continuous-time duration. In general, if the two tests perform identically, the circles (approximation) and triangles (AC) should be atop one another for every estimate–RC pattern–rc% triplet in all scatterplots. Already, Figure 1 makes clear that this is not the case.
I start by comparing my current results with those from previous work, to ground my findings' eventual, larger implications. Figure 1’s top row, second column most closely corresponds to Metzger’s (Reference Metzger2023c) simulations. This scatterplot, Corr(x 1,x 2) = 0, p = 1, with top 25% RC (scatterplot’s right half, medium gray points), is analogous to her Section 3.3’s “correct base specification” results.Footnote 22 My top 25% RC results match Metzger (Reference Metzger2023c): both calculations are appropriately sized or close to it (for x 1: 6.5% [approx.] vs. 5.5% [actual]) and both calculations are well powered (for x 2: 90.2% [approx.] vs. 90.6% [actual]). The calculations having similar size and power percentages also mirrors Metzger's (Reference Metzger2023c) Section 3.3.
The story changes in important ways once Corr(x 1,x 2) ≠ 0 (moving down Figure 1’s columns). Figure 1 shows that the AC performs progressively worse as Corr(x 1,x 2) becomes larger, evident in how the triangles representing non-PH violator x 1’s false positive rate move away from each scatterplot’s ${\hat{r}}_p = 0\%$ dividing line. The AC returns an increasingly large number of false positives for x 1 that far surpass our usual 5% threshold, nearing or surpassing 50% in some instances. This means we become more likely to conclude, incorrectly, that a non-PH-violating covariate violates PH as it becomes increasingly correlated with a true PH violator. Despite the AC’s exceptionally poor performance for non-violating covariates, it continues to be powered just as well or better than the approximation for PH violators, regardless of |Corr(x 1,x 2)|’s value. These patterns suggest that the AC rejects the null too aggressively—behavior that works in its favor for PH violators, but becomes a serious liability for non-PH violators.
By contrast, correlated covariates only marginally affect the approximated calculation. The approximation has no size issues across |Corr(x 1,x 2)| values—it stays at or near our 5% false positive threshold, unlike the AC. However, it does tend to become underpowered as |Corr(x 1,x 2)| increases, meaning we are more likely to miss PH violators as the violator becomes increasingly correlated with a non-PH violator. While this behavior is not ideal, it suggests that practitioners should be more mindful of their covariates’ correlations, to potentially contextualize any null results from the approximation.
Finally, Figure 1 shows these general patterns for both calculations persist across panels. More specifically, the patterns are similar when the baseline hazard is not flat (within the scatterplot set’s rows), for different censoring percentages (within a scatterplot’s half), and for different RC types (across a scatterplot’s halves, for the same rc%).
4.2 Broader Correlation-Related Patterns: Descriptive
The AC’s behavior is the more surprising of the two findings, but similarly as surprising, Figure 1’s patterns are not unusual. They are representative of the AC’s behavior in nearly all the 1,800 scenarios where n = 100. There are 360 unique combinations of the Weibull’s shape parameter (p), x 2’s TVE-to-main-effect ratio, recorded duration type, RC pattern, RC percentage, x 1’s mean, and x 1’s dispersion for n = 100. Of these 360, the AC’s false positive rate for |Corr(x 1,x 2)| ≠ 0 is worse than the comparable Corr(x 1,x 2) = 0 scenario in 359 of them (99.7%; Table 1’s left half, second column). For the lone discrepant combination,Footnote 23 three of the four nonzero correlations perform worse than Corr(x 1,x 2) = 0. Or, put differently: for the AC, out of the 1,440 n = 100 scenarios in which Corr(x 1,x 2) ≠ 0, 1,439 of them (99.9%) have a higher false positive rate than the comparable Corr(x 1,x 2) = 0 scenario. When coupled with the number of characteristics I vary in my simulations, this 99.9% suggests that the AC’s high false positive rate cannot be a byproduct of p, the PH violator’s TVE-to-main-effect ratio, the way in which the duration is recorded, the RC pattern or percentage, or x 1’s magnitude or dispersion.
Note: “Better” = lower FP% (x 1). Difference in FP% = (FP% for Corr = 0) − (FP% for this row’s correlation) within comparable scenarios; negative values: Corr = 0 performs better by x% percentage points. Number of unique combinations: 360.
Other AC-related patterns from Figure 1 manifest across the other scenarios as well. In particular, like Figure 1, the AC’s false positive rate gets progressively worse in magnitude as |Corr(x 1,x 2)| increases across all 360 combinations (Table 1’s right half, second column). On average, the AC’s false positive rate for Corr(x 1,x 2) = 0 is ~9 percentage points lower compared to |Corr(x 1,x 2)| = 0.35 and ~33.6 percentage points lower compared to |Corr(x 1,x 2)| = 0.65.
The AC’s most troubling evidence comes from Figure 1’s equivalent for n = 1,000 (Figure 2). With such a large n, both calculations should perform well because the calculations’ asymptotic properties are likely active. For Corr(x 1,x 2) = 0, this is indeed the case. Both calculations have 0% false positives for x 1 (size) and 100% true positives for x 2 (power), regardless of p, the RC pattern, or the RC percentage (Figure 2's first row). However, like Figure 1’s results, the AC’s behavior changes for the worst when Corr(x 1,x 2) ≠ 0. It continues to have a 100% true positive rate (Figure 2’s last two rows, x 2 triangles), but also has up to a 100% false positive rate, and none of its Corr(x 1,x 2) ≠ 0 false positive rates drop below 50% (Figure 2’s last two rows, x 1 triangles). Also, like Figure 1, the approximation shows no such behavior for Corr(x 1,x 2) ≠ 0.
These patterns for the AC appear across the other n = 1,000 Corr(x 1,x 2) ≠ 0 scenarios, of which there are 1,440. Corr(x 1,x 2) = 0 outperforms the comparable Corr(x 1,x 2) ≠ 0 scenario in all 1,440 scenarios. Figure 2’s 100% false positive rate also bears out with some regularity for the AC (330 of 1,440 scenarios [22.9%]); in all 330, |Corr(x 1,x 2)| = 0.65. In the remaining 1,110 scenarios, the AC’s lowest false positive rate is 22.6%. The AC’s behavior is so troubling because properly sized tests are typically our first priority in traditional hypothesis testing, as Section 4’s opening paragraph discusses. These results indicate that the AC is far from properly sized, whereas the approximation has no such issues. Taken overall, my simulation results for both sample sizes suggest that we should avoid using the AC for situations mimicking the scenarios I examined here, at minimum, if not also more broadly, provided we temporarily bracket other issues that may arise from using the approximation—a theme I return to in my closing remarks.
5 Illustrative Application
The simulations show that the AC is particularly susceptible to detecting violations, with many false positives when true PH violations do exist, but the PH violator(s) are even moderately correlated with non-violators. Political scientists typically correct for PH violations using an interaction term between the offending covariate and g(t). The potential perils of including an unnecessary interaction term are lower than excluding a necessary one, in relative terms. For any model type, unnecessary interactions produce less efficient estimates.Footnote 24 This increased inefficiency can take a particular toll in the presence of many such unnecessary interaction terms, which would occur in a Cox model context when a PH test reveals many potential PH violations.
Using the AC to diagnose PH violations for Agerberg and Kreft (Reference Agerberg and Kreft2020; hereafter “A&K”) illustrates the potential perils of the AC’s high false positive rate and its ramifications for inference. A&K’s study assesses whether a country having experienced high levels of sexual violence (SV) during a civil conflict (“high SV conflicts” [HSVC]) hastens the country’s adoption of a gender quota for its national legislature, relative to non-HSVC countries.Footnote 25 They find support for their hypotheses, including the one of interest here: HSVC countries adopt gender quotas more quickly compared to countries experiencing no civil conflict. In their supplemental materials, the authors check for any PH violations using the approximation, with g(t) = t. Two of their control variables violate at the 0.05 level (Table 2's “Approx.” column), but correcting for the violations does not impact A&K’s main findings.
Note: * = key independent variable. PH test g(t) = t, p-value threshold = 0.05 (A&K, Online Appendix B).
However, a different story emerges if I use the ACFootnote 26 to diagnose PH violations.Footnote 27 The AC detects six violations in A&K’s model—three times as many as the approximation. Importantly, A&K’s key independent variable, HSVC, is now a PH violator according to the AC, implying that the effect of high sexual violence during civil conflict is not constant across time. Furthermore, examining HSVC’s effect (Gandrud Reference Gandrud2015) from a fully corrected modelFootnote 28 shows that HSVC’s hazard ratio (HR) is statistically significant for only t ∈ [5,15] (Figure 3’s solid line).
The t restriction matters because 93% of the countries in A&K’s sample become at risk in the same calendar year, meaning HSVC now only affects whether countries adopt a legislative gender quota for a small subset of years in the past (1995–2004) for nearly their whole sample. This conclusion differs from A&K’s original findings, which suggested (1) a country having experienced HSVC always increased its chances of adopting a gender quota, relative to countries with no civil conflict, regardless of how long since the country could have first adopted a quota, and (2) this relative increase was of a lesser magnitude, evident by the vertical distance between HSVC’s estimated HR from the PH-corrected model (Figure 3’s solid line) and A&K’s original estimated HR (Figure 3, long-dashed horizontal line).
We do not know whether HSVC is a true violator because the data’s true DGP is unknown. However, three pieces of evidence suggest that HSVC may be a false positive, albeit not conclusively. First, there is a moderate correlation between HSVC and one of the control variables, “Conflict Intensity: High” (Corr = 0.516), which both the approximation and AC flag as a violator (Table 2). We know the AC is particularly prone to returning false positives in this situation. Second, HSVC’s scaled Schoenfeld plotFootnote 29 shows no unambiguous trends, as we would expect to see for a PH violator. Finally, a series of martingale residual plots show no clear non-linear trends,Footnote 30 ruling out model misspecification from incorrect functional forms, which was Keele’s (Reference Keele2010) and Metzger’s (Reference Metzger2023c) area of focus.
6 Conclusion
For Grambsch and Therneau’s (Reference Grambsch and Therneau1994) test for PH violations, does the way it is calculated affect the test’s performance? My Monte Carlo simulations show that the answer is a resounding yes. More importantly, I show that the performance differences are non-trivial. I find that the AC has a high false positive rate in situations where a PH violator is correlated with a non-PH violator, even for correlations as moderate as 0.35. The approximation does not suffer from the same issue, meaning that it has a crucial advantage over the AC, given the importance we place on correctly sized statistical tests in traditional hypothesis testing. From Supplementary Appendix G's meta-analysis, we know moderate correlations are the norm among political science applications, underscoring the potential danger of the AC’s behavior.
The biggest takeaway from these findings is that practitioners are currently stuck between a rock and a hard place. Both calculations perform adequately when covariates are uncorrelated with one another, but that condition is rarely true in social science applications. Purely on the basis of my simulation results, then, we should favor the approximation.
However, other factors preclude such an easy conclusion. One is a common limitation of any Monte Carlo study: the behavior I find for the approximation is limited in scope to the scenarios I investigated. It may be that, for other scenarios that vary different sets of characteristics, the approximation runs into performance issues similar to the AC. While this may certainly be true, the AC running into such serious performance issues for relatively simple, straightforward DGPs—while the approximation does not—is concerning and is sufficiently notable in its own right. These results also point to a number of related questions worth investigating. As one example, we might ask how the two calculations perform in a model with more than two covariates, and how the correlation patterns among those covariates might matter. The answers would be particularly relevant for applied practitioners.
A second factor is Therneau’s main motivation for shifting survival::cox.zph from the approximated to actual calculation. His concern was the approximation’s simplifying assumption being violated, which is particularly likely in the presence of strata (see fns. 1 and 5). In light of my results, though, violating the approximation’s assumption may be the lesser of two evils, if the choice is between that or the AC’s exceptionally poor performance for non-PH violators. Future research would need to investigate whether the trade-off would be worthwhile, and if so, under what conditions.
Finally, model misspecification is also a relevant factor. All the models I estimate here involve the correct base specification, with no omitted covariates or misspecified covariate functional forms. However, we know model misspecification can affect the PH test’s performance, in theory (Keele Reference Keele2010; Therneau and Grambsch Reference Therneau and Grambsch2000). Metzger (Reference Metzger2023c) examines how both calculations perform in practice with uncorrelated covariates, in both in the presence and absence of model misspecification. She finds that the approximation can have a high false positive rate for some misspecified base models, going as high as 78.3% in one of her sets of supplemental results.Footnote 31 Knowing the approximation can suffer from the same performance issues as the AC means we cannot leverage my simulation results regarding the approximation’s low false positive rate—the approximation returning evidence of a PH violation does not always mean a PH violation likely exists unless practitioners can guarantee no model misspecification exists, which is a potentially necessary, but likely insufficient, condition.
What might practitioners do in the meantime? The stopgap answers depend on the estimated Cox model's complexity, after addressing any model misspecification issues. If the Cox model has no strata and no strata-specific covariate effects, using the approximation is likely the safer bet. If the model has strata, but no strata-specific effects, practitioners can again use the approximation, but only after making the adjustments discussed in fn. 5. In the presence of both strata and strata-specific effects, there is no strong ex ante reason to suspect fn. 5’s adjustments would not work, but it is a less-studied situation, traditionally. Future research could probe more deeply to ensure this is the case, especially as competing risks models can fall into this last category.
Social scientists’ interest in a covariate’s substantive effect makes it paramount to obtain accurate estimates of that effect. Any covariate violating the Cox model’s PH assumption threatens that goal, if the violation is not corrected. I have shown here that successfully detecting PH violations is more fraught than we previously realized when using Grambsch and Therneau’s full, actual calculation to test for these violations, rather than an approximation of it. I have suggested some short-term, stopgap solutions, but more research needs to be done to develop more nuanced recommendations and longer-term solutions for practitioners.
Supplementary Material
To view supplementary material for this article, please visit http://doi.org/10.1017/pan.2023.34.
Data Availability Statement
All analyses run using Stata 17.0 MP6 or R 4.1.0. This article’s replication code has been published through Code Ocean and can be viewed interactively at https://doi.org/10.24433/CO.0072887.v1 (Metzger Reference Metzger2023a). A preservation copy of the same code and data can also be accessed via Dataverse at https://doi.org/10.7910/DVN/D56UWV (Metzger Reference Metzger2023b).
Acknowledgments
I thank Frederick Boehmke and Benjamin Jones for feedback on earlier drafts, as well as Harvey Palmer and Terry Therneau for general comments. Many of the simulations were run on the High Performance Computing Center at Michigan State University’s Institute for Cyber-Enabled Research. Any mistakes are mine alone.
Funding Statement
This work received no specific grant from any funding agency, commercial, or not-for-profit sectors.
Competing Interest
The author declares no competing interests exist.