1 Introduction
Researchers in many applied fields encounter data structures with observations that are grouped or clustered in one or more ways, for example students nested in classrooms and/or schools, and perhaps measured repeatedly over time. Such observations are referred to variably as grouped, clustered, hierarchical, multilevel, or repeated measures, and include panel, longitudinal, or time-series cross-sectional data. In one very common context—and the primary example considered here—investigators hope to estimate the effect of some “treatment,” or target covariate, that varies at a lower level while accounting for confounding that is hoped to be fixed at the higher level. For example, in country–year data, some countries may adopt a treatment in some years, and we seek to account at least for country- (group-) level confounders, in the hopes that remaining, time-varying confounding is absent or less problematic.
Multilevel data structures, however, pose complications for estimation and inference by violating the independence of observations assumed under classical inference. Although a long-standing and ubiquitous issue, methodological practices for dealing with this non-independence have not converged across disciplinary traditions. One tradition, often referred to as the “fixed effects” approach (FE), advises investigators to account for group-level confounders by introducing group-level, freely varying, intercepts to their models. The nonindependence of observations then complicates only variance estimation, so investigators are instructed to choose a variance estimator that accommodates the forms of intragroup dependency assumed to exist. One popular choice, closely examined here, is the “cluster-robust standard error” (White 1984).
A different tradition holds that multilevel data must be analyzed with “multilevel models” (MLMs). These models look similar to FE models, but include terms that are estimated as “random effects,” meaning that the coefficients are assumed to have a distribution rather than to be fixed in truth. In a review of 10 textbooks and 4 well-cited pedagogical articlesFootnote 1 as well as 109 empirical articles employing MLMs in top education, political science, and sociology journals,Footnote 2 we find four common reasons given to employ MLM rather than FE: (1) MLMs correctly estimate standard errors in grouped data; (2) MLMs estimate coefficients more efficiently and produce more accurate predictions than do the analogous FE models; (3) MLMs allow intercepts and slopes to vary by group, like analogous FE models; but (4) unlike those FE models, MLMs can estimate coefficients for group-level variables (and their interactions with lower-level variables) while allowing varying intercepts (and slopes). By contrast, particularly for fields closely aligned with econometrics, both long-standing (e.g., Hausman Reference Hausman1978) and more recent work (e.g., Clark and Linzer Reference Clark and Linzer2015) emphasize bias concerns with MLMs. In the common setting where one particular variable is the treatment, MLM estimates can have lower variance than do FE estimates for the coefficient of interest, but are biased when group-level confounding is present, compelling users to employ FE.Footnote 3
The choice between these approaches remains a matter of debate and disciplinary tradition, and is sometimes justified based on erroneous claims as to these models’ properties. In this paper, we seek to demystify their differences, the problems associated with them, and ultimately important equivalences between them. First, to illuminate a key difference between the approaches, we show that random effect estimates in MLMs are precisely equivalent to FE estimates that have been shrunken through a regularization process, that is, by penalizing larger coefficients. This explains the principle concern with random effects: bias that emerges because these variables are not “allowed” to adjust for confounding as intended. Because regularization reduces over-fitting, this also demystifies why MLM’s out-of-sample outcome predictions are typically more accurate. Second, this bias can be eliminated in many cases by a long-standing adjustment from Mundlak Reference Mundlak1978. For models with group-level intercepts added as random effects (“random intercept” models), which we focus on here, adding the group-level averages of all included variables as regressors (or a variety of equivalent procedures such as group-wise centering) relieves the bias otherwise induced by this regularization. We also present a more general solution for more complex models. Finally, we consider variance estimation. Contrary to claims we document, MLMs do not automatically ensure appropriate standard error estimates for grouped data. Rather, the standard MLM approach relies on strict assumptions and can have poor coverage in even simple circumstances.
We are not the first to draw many of these conclusions. However, our review of empirical papers employing MLMs shows widespread failure to either appreciate the concerns they raise or employ suggested solutions. Bringing these concerns together, we describe a “bias-corrected MLM” that employs cluster-robust standard errors, which make no assumption about within-group dependency and assume no between-group dependence. Remarkably, once these adjustments are made, the resulting MLM produces coefficient and standard error estimates identical to those of the analogous FE model with cluster-robust standard errors. We regard this as the most important conclusion, since for investigators interested only in the target coefficient and its standard error, it resolves any debate as to which approach is more appropriate. In most cases, we thus recommend either of these unbiased approaches over uncorrected MLM, due to the dangers of heightened bias and root mean square error when MLM’s strict assumptions are violated. The remaining differences between these approaches are that MLM has better (out-of-sample) accuracy for the outcome predictions, and allows the user to estimate coefficients that would otherwise be dropped by FE (e.g., group-level covariates), although interpreting these coefficients can be problematic.
2 Background
2.1 Notation
We first set notation. To aid the reader, Appendix A.1 describe (i) symbols used in our notation (Table 1) and (ii) abbreviations we use relating to models (Table 2).
Let $g=1, \dots , G$ index the group. We index vectors belonging to group g with the subscript g, for example, the outcome for all units in group g is given by the vector $Y_g$ . The ith unit in group g is then indexed by $g[i]$ , for example, unit i in group g has outcome $Y_{g[i]}$ . This notation reminds the reader that unit i is contained in group g. Each group g has size $n_g$ with $N = \sum _{g=1}^{_G} n_g$ . Let $X_{g[i]}$ be the p-dimensional vector of covariates, including an intercept term, with an associated coefficient vector $\beta $ . One element of $X_{g[i]}$ in particular will be regarded as a treatment in settings described here. $X_g$ is the matrix of $X_{g[i]}$ for group g, and X is the matrix of $X_{g[i]}$ for the entire sample.
Similarly, $Z_{g[i]}$ is a d-dimensional vector of covariates, often containing a subset of the covariates in $X_{g[i]}$ , and possibly an intercept which will later function as an indicator of membership to group g. For each group g, the $Z_{g[i]}$ have an associated coefficient vector $\gamma _g$ . $Z_g$ is the matrix of $Z_{g[i]}$ for group g, Z is a block diagonal matrix of the $Z_g$ , and $\gamma $ stacks the set of $\gamma _g$ into a matrix.
As noted, $Y_{g[i]}$ is the outcome of interest, and $\epsilon _{g[i]}$ is its associated residual/error term. Y and $\epsilon $ are $N \times 1$ vectors containing $Y_{g[i]}$ and $\epsilon _{g[i]}$ for the entire sample.
2.2 Fixed effect and multilevel models
Varying intercepts: Group fixed effects and random intercept models
We focus mainly on uses of FE and MLM that allow each group in the data a different intercept but no other group-varying coefficients.Footnote 4 Suppose data are generated according to the simple model, written in matrix form,
where Z contains only indicators for membership to each group g, that is, each $Z_g$ is a vector of ones, meaning that only intercepts may differ between groups. While useful to write in this form because it shows the role of Z, the model is often expressed as
where $\gamma _g$ are group-specific deviations from the overall intercept in $\beta $ . Although a model can be fitted regardless of the correlation between the residuals and $X_{g[i]}$ , investigators are usually interested in understanding the effect of a treatment variable (in $X_{g[i]}$ ) on $Y_{g[i]}$ , which requires the “conditional independence assumption,” ${\mathbb{E}}(\epsilon _{g[i]} \ | \ X, Z)=0$ . This is an identification assumption relating to the form of confounding, although it is also subject to model specification. We describe the types of permitted confounding that are used to justify this assumption in Section 2.3.
The distinction between FE and MLM for the varying intercepts model in Equation (2) relates to the assumptions they employ during estimation. Both FE and MLM regard $\beta $ as “fixed,” that is, nonrandom with no distributional assumption. However, FE and MLM diverge in their handling of $\gamma _g$ . In FE, the $\gamma _g$ are regarded as fixed, like $\beta $ . FE thus estimates $\beta $ and $\gamma $ by including indicator variables for group membership (in Z) as additional regressors in an OLS of Y on X, or by equivalent demeaning/partialing-out procedures. We refer to this as “group fixed effects” (Group-FE) here. In practice, one group indicator and its corresponding $\gamma _g$ must be dropped, or the intercept term must be dropped from $X_{g[i]}$ .
MLM, by contrast, treats the $\gamma _g$ as “random effects,” meaning that each $\gamma _g$ is estimated under the assumption that it may have a distribution, that is has nonzero variance rather than being a fixed quantity. This has implications for estimating both $\gamma $ and $\beta $ , as well as for constructing variance estimates, detailed in Section 2.4.Footnote 5 Specifically, we define the “random intercept” (RI) model,
where we also assume that for all $g,g'$ , and i. Additionally, the conditional independence assumption is elaborated to require a multivariate-normal distribution for the residual vectors in each group, $\epsilon _g \ | \ X, Z \overset {ind}{\sim } \mathcal {N}(0, \Sigma _g)$ where $\Sigma _g \in \mathbb{R}^{n_g \times n_g}$ is group g’s error covariance matrix. All $\gamma _g$ can be kept and estimated by this approach, together with an intercept in $\beta $ . It is also commonly assumed that the $\epsilon _{g[i]}$ are spherical (i.e., $\Sigma _g = \sigma ^2 I_{n_g}$ ), although we avoid that restriction unless otherwise noted.
General fixed effect and multilevel models
Although we focus mainly on Group-FE and RI models here, the claims made below pertain to FE and MLM in their more general form,
where the $\gamma _g$ represent group-specific coefficients for the variables in $Z_g$ , which may now include variables other than group indicators as above. As before, FE estimates both $\beta $ and $\gamma $ by OLS regression of Y on X and Z. For identifiability, the covariates that $X_{g[i]}$ and $Z_{g[i]}$ share (or that result in perfect colinearity) are either dropped from $Z_{g[i]}$ for one g, or dropped from $X_{g[i]}$ .
When instead fit with random effects in MLM, the coefficients $\gamma _g$ are estimated with additional distributional assumptions, specifically
where $\Omega \in \mathbb{R}^{d \times d}$ and we assume for all $g,g'$ , and i.Footnote 6 In addition, we define $ {{\rm var}}(\epsilon \ | \ X,Z) = \Sigma $ to be the ordered block diagonal matrix of $\Sigma _g$ , and $\Omega _{\text {block}} = {{\rm var}}(\gamma \ | \ X,Z)$ to be the block diagonal matrix of $\Omega$ ,
2.3 Identification: From nonparametric conditions to specification requirements
Let us assume there is no within-group confounding. In longitudinal settings, this is to say there are no unobserved time-varying confounders. Such an assumption ensures non-parametric identifiability, that is, the causal effect of the treatment can be identified conditionally on group and then averaged together as desired across groups.Footnote 7
For such nonparametric identification to be sufficient for unbiased estimation, however, would require the ability to condition on group non-parametrically (i.e., estimate the relationship between treatment and outcome within each group separately). In most settings the investigator is unable to do this, and so turns to modeling assumptions that instead “account for” group in a specific model. Because of this model dependence, identification of treatment effects then requires additional assumptions related to the specification of those models.
The traditional motive for Group-FE is an assumption that the only source of confounding is group-level confounding that takes the linear form of Equation (2): a constant, additive shift by group (i.e., the $\gamma _g$ ). The conditional independence assumption needed is ${\mathbb{E}}(\epsilon _{g[i]} \ | \ X, Z)=0$ (e.g., Greene Reference Greene2003). The Group-FE approach is powerful precisely because including Z fully purges the group-level intercepts, $\gamma _g$ , from this residual, thus removing confounding. So long as other (within-group) forms of confounding do not exist, this unbiasedly estimates $\beta $ .
The central concern with RI is that even under the conditional independence assumption, random effect estimation fails to remove this confounding. As we explain in Section 3.1, this is because RI does not fully account for the $\gamma _g$ , biasing the estimate of $\beta $ as well. Avoiding this bias requires additionally that the “uncorrelated random effects” assumption be true: that the $\gamma _g$ are uncorrelated with $X_{g[i]}$ , so that failure to account for $\gamma _g$ does not bias estimates of $\beta $ .Footnote 8 ${{\hspace {-0.0001pt}}}^{,}$ Footnote 9 This is problematic, since $\gamma _g$ is most important to account for when it is correlated with the treatment variable of interest in $X_{g[i]}$ and thus acts as a confounder. While this assumption, also referred to simply as the “random effects assumption” (Bell and Jones Reference Bell and Jones2015; Kim and Steiner Reference Kim and Steiner2019), is well-known in principle, it remains widely neglected in practice (see Section 3.1) despite attractive solutions (see Section 3.2).
2.4 Parameter estimation in MLM
We begin by briefly reviewing how MLM parameters are estimated.Footnote 10 Estimation proceeds in three steps: (1) estimate $\Omega $ and $\Sigma $ , (2) estimate $\beta $ using the estimate of $(\Omega , \Sigma )$ , and (3) estimate $\gamma $ using the estimate of $(\beta , \Omega , \Sigma )$ . The first two steps require the distribution of Y given X and Z. Because, in MLM, the $Z_{g[i]}^{\top } \gamma _g$ are mean-zero given X and Z, the $(Z_{g[i]}^{\top } \gamma _g + \epsilon _{g[i]})$ can be treated as combined mean-zero error terms. This allows one to formulate MLM on the sample-level as $Y = X \beta + \epsilon ^{*}$ where $\epsilon ^{*} = Z \gamma + \epsilon $ . Then, because $\gamma $ and $\epsilon $ are both normally distributed given X and Z, we have
The likelihood is then
Given a choice of $(\Omega , \Sigma )$ , which determines V, maximizing the likelihood for $\beta $ would yield
However, estimates of $\Omega $ and $\Sigma $ do not enjoy similarly simple closed solutions. Instead, they are typically found iteratively through either unrestricted maximum likelihood estimation or restricted maximum likelihood estimation, both using Equation (5). $\beta $ can then be estimated by plugging estimates of $\Sigma $ and $\Omega $ into Equation (6). Finally, $\gamma $ is estimated by maximizing its posterior probability given Y, X, Z, and the estimated $(\beta , \Omega , \Sigma )$ , that is,
In this article, we focus solely on estimation and inference for $\beta $ and $\gamma $ . Because MLM’s estimates of these parameters are functions of $\Omega $ and $\Sigma $ , our results hold regardless of the choice between unrestricted or restricted maximum likelihood in estimating $\Omega $ and $\Sigma $ . For this reason, we refer to arbitrary MLM estimates of the parameters as $\hat {\Omega }_{\text {MLM}}$ , $\hat {\Sigma }_{\text {MLM}}$ , $\hat {\beta }_{\text {MLM}}$ , and $\hat {\gamma }_{\text {MLM}}$ .
Finally, treating $\hat {V}_{\text {MLM}}$ as fixed, the conditional variance of $\hat {\beta }_{\text {MLM}}$ is simply
which is commonly simplified to $(X^{\top } \hat {V}_{\text {MLM}}^{-1} X)^{-1}$ as the standard model-based MLM variance estimator, by assuming that $\hat {V}_{\text {MLM}}$ is a consistent estimator of ${{\rm var}}(Y \ | \ X, Z)$ .
3 Analytical insights
We now analyze three features of MLM that aid in understanding how these methods compare.
3.1 Random effects as regularization and the “incomplete conditioning” problem
The first piece shows an equivalence between the random effects employed in MLM and regularization, that is, a penalized fitting approach. We begin by introducing the “regularized fixed effects” (regFE) class of models. Suppose we are interested in optimal out-of-sample generalization. Then, instead of estimating Equation (2) by minimizing (in-sample) squared error, we minimize the squared error plus a penalty proportional to the sum of squared $\gamma _g$ values, known as $L_2$ or Tikhonov regularization:
where $\lambda>0$ determines the extent of regularization on $\gamma _g$ —larger values shrink each $\gamma _g$ toward to 0, while smaller values yield estimated intercepts closer to Group-FE’s estimates—and may be chosen by various means such as some form of cross-validation.Footnote 11 The key result is that for a particular choice of $\lambda $ , such a regularized model is equivalent to the RI model,
Theorem 3.1 Equivalence of RI and regFE
Let $(\hat {\omega }^2_{\text {RI}}$ , $\hat {\Sigma }_{\text {RI}}$ , $\hat {\beta }_{\text {RI}}$ , $\hat {\gamma }_{\text {RI}})$ be estimates from the RI model. If (i) $\Sigma = \sigma ^2 I_N$ and (ii) $\lambda = \hat {\sigma }^2_{\text {RI}}/\hat {\omega }^2_{\text {RI}}$ in Equation (9), then $(\hat {\beta }_{\text {regFE}}, \hat {\gamma }_{\text {regFE}}) = (\hat {\beta }_{\text {RI}}, \hat {\gamma }_{\text {RI}})$ .
We omit the proof here, as Theorem 3.2 below offers a generalization. That MLM provides “shrinkage” or “partial-pooling” estimates is very well-known (e.g., Steenbergen and Jones Reference Steenbergen and Jones2002; Fitzmaurice, Laird, and Ware Reference Fitzmaurice, Laird and Ware2004; Luke Reference Luke2004; Gelman Reference Gelman2006), and in recent texts MLM has even been referred to as employing a “regularizing prior” (e.g., McElreath Reference McElreath2018). The equivalence between regularization and holding a prior on the regularized coefficients is also well known.Footnote 12 That said, because we did not find any explicit formalization of the equivalence of estimation in MLM to regularization with a specific choice of $\lambda $ in any textbooks or articles we reviewed, we fill this gap.
We demonstrate with a simple simulation example, drawing 1,000 datasets from the following data generating process (DGP) with $G=25$ and $n_g = 15$ each time:
where $\lambda $ for regFE is chosen by cross-validation and $(\hat {\sigma }^2_{\text {RI}}, \hat {\omega }_{\text {RI}}^2)$ for RI by restricted maximum likelihood.Footnote 13 We see in Figure 1 that the estimates of $\beta $ and $\gamma $ are nearly identical. They would be numerically equal if, instead of setting $\lambda $ in regFE by cross-validation, we chose $\lambda = \hat {\sigma }^2_{\text {RI}} / \hat {\omega }^2_{\text {RI}}$ as per Theorem 3.1. However, despite the different selection procedures, we also see from Figure 1 that the $\lambda $ from RI and regFE are quite similar. This is because while $\lambda = \hat {\sigma }^2_{\text {RI}}/\hat {\omega }^2_{\text {RI}}$ in RI is not made specifically with predictive accuracy in mind, the choice is a sensible one from a prediction point of view. To see this, note first that the portion of $Y_{g[i]}$ that is not explained by $X_{g[i]}$ is the sum of $\gamma _g$ and $\epsilon _{g[i]}$ , which have variances of $\omega ^2$ and $\sigma ^2$ , respectively. When, for example, the $\gamma _g$ have high variance in relation to that of $\epsilon _{g[i]}$ , $\lambda = \sigma ^2/\omega ^2$ will be small, which is appropriate since $\gamma _g$ will be helpful in predicting $Y_{g[i]}$ . The converse is true when the $\gamma _g$ have low variance relative to $\epsilon _{g[i]}$ , leading to a higher effective $\lambda $ and desirable shrinkage on $\gamma _g$ to avoid over-fitting.
More generally, consider estimating Equation (3) with the regularized regression:
Again, $\Lambda $ may be obtained by various means, such as cross-validation.Footnote 14 The equivalence with MLM is then given in Theorem 3.2.
Theorem 3.2 General equivalence of MLM and regFE
Let $(\hat {\Omega }_{\text {MLM}}$ , $\hat {\Sigma }_{\text {MLM}}$ , $\hat {\beta }_{\text {MLM}}$ , $\hat {\gamma }_{\text {MLM}})$ be the estimates from MLM. If (i) $\Sigma = \sigma ^2 I_N$ and (ii) $\Lambda = \hat {\sigma }^2_{\text {MLM}} \hat {\Omega }_{\text {MLM}}^{-1}$ in Equation (11), then $(\hat {\beta }_{\text {regFE}}, \hat {\gamma }_{\text {regFE}}) = (\hat {\beta }_{\text {MLM}}, \hat {\gamma }_{\text {MLM}})$ .
Proof of Theorem 3.2 is given in Appendix A.3 and proves Theorem 3.1 as a special case with $Z_{g[i]} = [1]$ . This equivalence is useful in comprehending several of MLM’s most central advantages, and limitations. First, the equivalence to regularization immediately explains why MLM can produce more accurate (out-of-sample) outcome predictions than does FE: regularization prevents the over-fitting that FE may allow, particularly in the case of small groups.Footnote 15
Further, MLM is sometimes understood to “wisely” adapt the level of shrinkage based on group size, but the comparison to regularization shows that such adaptation is not as sophisticated as it may appear. When MLM or regFE encounters a group that appears to have a very high or low mean relative to others, choosing an extreme $\gamma _g$ would decrease the sum of squared errors for that group, $\sum _{i=1}^{n_g} (Y_{g[i]} - X^{\top }_{g[i]} \beta - Z_{g[i]}^{\top } \gamma _g)^2$ . For a relatively small group, the savings in squared loss would be smaller relative to the additional “cost” paid through the regularization penalty, $ \gamma _g^{\top } \Lambda \gamma _g$ , and so $\gamma _g$ will be left relatively close to zero. By comparison, when MLM or regFE encounters a larger group, the savings in terms of squared loss would justify the added regularization penalty, so the choice of $\gamma _g$ minimizing the penalized sum of squared errors will be a more extreme one. Although this behavior appears to intelligently balance prior knowledge with allowing the data to speak, it is reproduced simply through regularization.
Another feature of MLM that can be understood through the regularization view is its ability to estimate coefficients for group-level variables even while including group-specific intercepts and slopes that would have prevented model identification under OLS. This occurs for the same reason that one can include more coefficients than observations in a “ridge regression.” Consider attempting to find $(\hat {\beta }, \hat {\gamma })$ purely by OLS,
This has no unique solution if $X_{g[i]}$ contains a group-level variable or its interaction with a variable in $Z_{g[i]}$ , as $[X \ Z]^{\top } [X \ Z]$ is then singular. However, by introducing the regularization penalty into the minimization problem, regFE essentially adds to $[X \ Z]^{\top } [X \ Z]$ a positive semi-definite matrix that allows the sum of the two matrices to be invertible.
Finally, this equivalence illuminates the main concern with MLM: bias when the random effects are correlated with $X_{g[i]}$ . Because the group-specific intercepts in a RI model are regularized, they do not achieve the values that would “fully absorb” group-specific confounding, leaving components unexplained that can instead be captured by biasing the coefficients on $X_{g[i]}$ . In Equation (2), where we hope to condition on group to absorb group-level confounders, random effects thus offer only “incomplete conditioning,” not fully accounting for the unobserved group-level variables’ influence on the outcome. To illustrate, consider the following DGP:
where $X_{g[i]} \in \mathbb{R}$ is an observed observation-level variable and the $W^{(\ell )}_g$ are unobserved group-level covariates, with $W^{(1)}_g$ being a confounder. While Group-FE will unbiasedly estimate $\beta _1$ , a simple OLS regression of Y on X would produce a biased estimate of $\beta _1$ , having failed to account for $W^{(1)}_g$ . RI may seem more appealing than OLS because investigators might hope that the $\gamma _g$ will capture the group-level confounding. However, the shrinkage of $\gamma _g$ due to treating them as random effects yields estimated intercepts closer to zero than those obtained by Group-FE and required to account for $W^{(1)}_g$ . This leaves some of $W^{(1)}_g$ unabsorbed, allowing it to continue biasing $\hat {\beta }_1$ as in the OLS model. We call this “incomplete conditioning” because the intended analytical strategy required to estimate $\beta $ unbiasedly would condition on group, but the use of RI fails to achieve this.
Figure 2 illustrates this. Define bias and root mean square error (RMSE) as,
where m indexes the number of iterations from 1 to M and $\hat {\beta }^{(m)}$ is the estimate from the mth iteration. Bias is large, as expected, for OLS as it fails to account for group-level confounding at all. RI makes almost no improvement when groups are small ( $n_g=5$ ), and only a partial improvement when the groups are quite large ( $n_g=50$ ). Note that unbiasedness of RI would require the absence of correlated random effects, that is no correlation between $X_{g[i]}$ and $W_g^{(1)}$ , which is to say an absence of group-level confounding. Had this been the case, OLS would also suffice, and would differ from RI only in its efficiency and how variance is estimated. By contrast, in the presence of such correlation, Group-FE effectively eliminates confounding bias at both group sizes. In terms of efficiency, while RI has the expected slight decrease in variance, its RMSE remains about twice that of Group-FE at either group size due to these biases. That RI’s average bias falls as $n_g$ rises may seem to be a cause for hope when one has a large enough dataset, and Lockwood, McCaffrey, et al. (Reference Lockwood and McCaffrey2007) describes the conditions under which this type of bias tends to 0 as $n_g \rightarrow \infty $ generally. However, in practice, there is no knowing if $n_g$ is large enough to ensure negligible bias in a given case, as this depends on the correlation between the covariates and the random effects. We also note that with group-level covariates, increasing $n_g$ may not alleviate bias (see Appendix A.4).
Comparison to practice
For each of the three analyses in Sections 3.1, 3.2 and 3.3, we briefly contrast what is already known of these claims to what we find in practice. In this case, both textbooks and pedagogical articles remark heavily on the correlated random effects assumption, albeit not usually in terms of regularization or incomplete conditioning. Yet, this most central of concerns regarding MLM is demonstrably neglected in empirical practice. Among the MLM-based studies we reviewed where such bias would be at issue, only one in 24 education articles, 13 of 39 political science articles, and 10 of 19 sociology articles addressed the issue.
3.2 Bias-corrected MLM
The second analytical piece is that while MLM’s bias problem with correlated random effects appears to be dire, there is a simple solution dating back to Mundlak (Reference Mundlak1978). With this fix, MLM unbiasedly estimates coefficients for variables below the level of grouping or clustering, like FE does, while retaining the ability to include group-level variables in the model and the superior predictive accuracy that arises from regularization/random effects. For RI, the fix requires adding to the model the group-level means of $X_{g[i]}$ (including any interactions or other nonlinear terms).
As a stepping stone, consider a similar approach that could be applied to OLS as a substitute for FE. We consider again DGP 1 from Section 3.1, but suppose we estimate the following model by OLS,
where $\bar {X}_g =\frac {1}{n_g} \sum _{i=1}^{n_g} X_{g[i]}$ . Informally, adding $\bar {X}_g$ “soaks up” any contribution that $W_g^{(1)}$ could have made to $Y_{g[i]}$ that could be correlated with $X_{g[i]}$ , protecting the estimate of $\beta _1$ in the same way that FE would, and leaving an unbiased estimate of $\beta _1$ under the conditional independence assumption. Proof is given in Appendix A.5. This bias-curing effect of including $\bar {X}_g$ in OLS can similarly be applied to the RI model, and alleviates RI’s bias problem, with $\Sigma = \sigma ^2 I_N$ , in estimating $\beta _1$ . We omit the proof of this result, as we prove a more general claim below. In fact, an OLS model including $\bar {X}_g$ , the Group-FE model, and the RI model including $\bar {X}_g$ all produce exactly the same point estimates. Despite this equivalence, the RI model retains MLM’s greater (out-of-sample) predictive accuracy for the outcome compared to Group-FE—a benefit of the regularization imposed on $\gamma _g$ , as illustrated in Figure 3.
One can easily generalize this alteration to other models with varying intercepts but with multiple covariates or treatments: simply add to the RI model the group-level means of all included variables, $\bar {X}_g = \frac {1}{n_g} \sum _{i=1}^{n_g} X_{g[i]}$ , including any interactions or nonlinear transformations. If spherical observation-level errors are assumed, the coefficient estimates from this model for lower-level variables are exactly equal to those obtained by Group-FE. Historically, the inclusion of $\bar {X}_g$ was proposed by Mundlak (Reference Mundlak1978) and extended by Chamberlain (Reference Chamberlain1979, Reference Chamberlain1982), and is known in some econometrics-informed traditions as the “correlated random effects” (CRE) approach (Wooldridge Reference Wooldridge2010; Schunck Reference Schunck2013). Although originally motivated by imposing the assumption ${\mathbb{E}} (\gamma _g \ | \ X_{g}, Z_{g}) = \bar {X}_g^{\top } \alpha $ , we show unbiasedness without this assumption by showing its equivalence to “partialing out” of the group-level effects, which is in turn equivalent to Group-FE. A closely related approach is the hybrid model of Allison (Reference Allison2009), which group-demeans $X_{g[i]}$ in addition to including $\bar {X}_g$ . This model is an isomorphic variation on models that only include $\bar {X}_g$ without demeaning, producing the same coefficient of interest on X but altering the way in which coefficients are combined to interpret between-group differences (see Schunck Reference Schunck2013).
We now turn to the more general debiasing approach for cases allowing multiple random coefficients and not just group intercepts. We refer to this as “bias-corrected MLM” (bcMLM), and use this label throughout the remainder of the paper to cover the special case of bias-corrected RI as well.Footnote 16 We first take the projections of the “fixed effect variables” ( $X_g$ ), excluding the intercept, onto the random effect variables ( $Z_{g}$ ) within each group, $\tilde {X}_{g[i]} = Z_{g[i]}^{\top } (Z_g^{\top } Z_g)^{-1} Z_g^{\top } X_{g}$ .Footnote 17 These projections are then added to the regression as “fixed effect variables,”
where $\beta $ and $\alpha $ are assumed fixed, and we continue to assume that $\epsilon _g \ | \ X, Z \overset {ind}{\sim } \mathcal {N}(0, \Sigma _g)$ and for all $g,g'$ , and i. When $\Sigma =\sigma ^2 I_N$ , and provided it is not overidentified (described below), bcMLM produces estimates for $\beta $ that are unbiased under the conditional independence assumption (Appendix A.8) and identical to FE estimates (Appendix A.9).
We make two remarks on this result. First, it provides unbiasedness only when spherical observation-level errors are assumed.Footnote 18 While the choice of variance-covariance matrix for the errors has no impact on point estimates under FE, the point estimates from MLM are sensitive to this choice. Second, the RI model including $\bar {X}_g$ noted above is a special case of bcMLM: if $Z_g = \vec {\mathbb {1}}_{n_g}$ , then $\tilde {X}_{g[i]} = [1] ( \vec {\mathbb {1}}_{n_g}^{\ \top } \vec {\mathbb {1}}_{n_g})^{-1} \vec {\mathbb {1}}_{n_g}^{\ \top } X_{g} = \bar {X}_g^{\top } $ .
Limitations and the per-cluster regression
One limitation of bcMLM is that it can fail to eliminate potential biases for coefficients of certain variables because including $\tilde {X}_{g[i]}$ results in an overidentified model. A simple example would be a RI model that includes a group-level covariate, $U_g$ , in $X_{g[i]}$ . The proposed alteration suggests that one includes $\bar {U}_g = \frac {1}{n_g} \sum _{i=1}^{n_g} U_g$ , but this is simply the same $U_g$ already included (see Appendix A.6 for an example). Therefore, unless $U_g$ is independent of the random effects and any included lower-level variables (e.g., if $U_g$ were randomly assigned as a group-level treatment), the estimated coefficients for $U_g$ may be biased.Footnote 19
Thankfully, as long as $U_g$ is uncorrelated with the random intercepts, its coefficient can be unbiasedly estimated if desired by adding a “per-cluster regression” step as proposed by Bates et al. (Reference Bates, Castellano, Rabe-Hesketh and Skrondal2014). For example, suppose we have one observation-level variable $X_{g[i]}$ and one group-level variable $U_g$ , with coefficients $\beta _1$ and $\beta _2$ , respectively. First, using Group-FE or bcMLM (here, an RI model including $\bar {X}_{g}$ ), unbiasedly estimate $\hat {\beta }_1$ . Then remove the estimated marginal effect of X from Y, forming $Y_{g[i]}^{\perp } = Y_{g[i]} - \hat {\beta }_1 X_{g[i]}$ . The per-cluster step is to then regress the G group-level means of $Y_{g[i]}^{\perp }$ on $U_g$ and an intercept term by OLS. We provide an example of this process in Appendix A.12.
Another important example of an over-identification limitation to bcMLM occurs when the slope for $X_{g[i]}^{(\ell )}$ is allowed to vary, that is, $X_{g[i]}^{(\ell )}$ is included in $Z_{g[i]}$ . Because bcMLM includes as extra covariates the predictions of $X_{g[i]}^{(\ell )}$ using $Z_{g[i]}$ , and $Z_{g[i]}$ predicts $X_{g[i]}^{(\ell )}$ perfectly, including this “prediction” simply includes $X_{g[i]}^{(\ell )}$ in the model twice. One of the $X_{g[i]}^{(\ell )}$ would be dropped out of the model, and the “prediction” of $X_{g[i]}^{(\ell )}$ cannot soak up the bias from any potentially correlated random effects when estimating $\beta _{\ell }$ . This is also true of coefficients for any cross-level interactions, $X_{g[i]}^{(\ell )} U_{g}$ . Here again the per-cluster regression provides an option for users who are interested in those coefficients, as demonstrated in Appendix A.13.Footnote 20
Comparison to practice
The main takeaway from this analysis is that bcMLM removes the bias of MLM due to correlation of random effects with the treatment, and in so doing, produces coefficient estimates identical to FE (when $\Sigma = \sigma ^2 I_N$ is assumed), while retaining MLM’s superior predictive accuracy for the outcome and the ability to model group-level variables, which may or may not be of interest to investigators depending on their task. Yet, among articles we reviewed where bias due to correlated random effects would be at issue, none of 24 education articles, one in 39 political science articles, and 2 of 19 sociology articles employed bcMLM or any other suitable debiasing approach for MLM.
3.3 Variance estimation
The third and final analytical piece is that the “default” MLM standard errors are based on narrow assumptions that are often inappropriate, but this too can be remedied. A commonly cited motive for using MLM is the claim that it ensures appropriate standard errors for multilevel data (e.g., Luke Reference Luke2004; Morgan and Kelly Reference Morgan and Kelly2017; Beazer and Blake Reference Beazer and Blake2018; Campbell and Ronfeldt Reference Campbell and Ronfeldt2018; Rueda Reference Rueda2018; Clements et al. Reference Clements, Sarama, Baroody, Joswick and Wolfe2019; Pardos-Prado and Xena Reference Pardos-Prado and Xena2019). This is only true when the sole source of heteroskedasticity or dependency between residuals arises due to the random effects included by $Z_{g[i]}$ and their contributions to $Y_{g[i]}$ . That a default standard error exists for MLM, and is the only choice in some software, does not imply it is always an appropriate choice.
Recall from Section 2.4 that the random effects contribution, $Z_{g[i]}^{\top } \gamma _g$ , and the idiosyncratic error, $\epsilon _{g[i]}$ , can be thought of as a single mean-zero error term $\epsilon ^{*}_{g[i]} = Z_{g[i]}^{\top } \gamma _g + \epsilon _{g[i]}$ in the model $Y_{g[i]} = X^{\top }_{g[i]} \beta + \epsilon ^{*}_{g[i]}$ . The dependency between two observations’ outcomes within the same group, conditional on X and Z, is then
MLM can be understood as a framework for structuring this covariance, by specifying which variables enter the models as random effects (i.e., $Z_{g[i]}$ ) and parameterizing $\Omega $ and $\Sigma $ . In the RI model, with $Z_{g[i]} = [1]$ , the combined error, $\epsilon ^*_{g[i]}$ , is $\gamma _g + \epsilon _{g[i]}$ . Under the conditional independence assumption and spherical errors, $\epsilon _{g[i]} \ | \ X, Z \overset {iid}{\sim } N(0, \sigma ^2)$ , the dependence structure is
In other words, $Y_{g[i]}$ is modeled as linear in $X_{g[i]}$ with error, but instead of independent observations, there is constant covariance between observations in the same group, and this covariance does not differ by group. This yields a compound symmetric covariance matrix for each group’s error vector, $\epsilon _g^{*} = Y_g - X_g \beta $ . If $Z_{g[i]} = [1 \ X^{(1)}_{g[i]}]^{\top }$ , then $\epsilon ^*_{g[i]}=\gamma _{0g} + \gamma _{1g} X^{(1)}_{g[i]} + \epsilon _{g[i]}$ . Maintaining that $\epsilon _{g[i]} \ | \ X, Z \overset {iid}{\sim } N(0, \sigma ^2)$ and that (conditionally on X and Z) $\gamma _{0g}$ and $\gamma _{1g}$ are drawn from $N(0, \omega _0^2)$ and $N(0, \omega _1^2)$ with covariance $\omega _{01}$ yields intragroup covariances of
With the addition of more random effect variables, the variance structure becomes more complex. This complexity should not be equated with generality—the variance is still assumed to be a highly prescribed function of the data. To illustrate the potential for variance estimates with poor coverage, consider the following longitudinal DGP, where g indexes the person and $t=1, \dots , T$ indexes the time-point of the observation:
Here, there is an observation-level variable $X_{g[t]}$ that is auto-correlated; a group-level variable $U_g$ ; and a random intercept $W_g$ . The observation-level error terms are auto-correlated as well with (heteroskedastic) variances that depend on $U_g$ . The correct dependence structure would be
Using the “default” variance with RI, assuming $\Sigma =\sigma ^2 I_N$ , Figure 4 shows coverage rates for nominal 95% confidence intervals of $\beta _1$ and $\beta _2$ across draws from DGP 2. Typically we would focus interest on $\beta _1$ , motivated by an assumption of no within-group confounding. However, we also show results for $\beta _2$ because group-level variables were often of interest in the empirical works we reviewed. The RI standard errors are consistently too small for both $\beta _1$ and $\beta _2$ across all sample sizes. The undercoverage for $\beta _1$ worsens as the total number of time-periods (T) increases. Coverage improves for $\beta _2$ as T increases, but remains unsatisfactory even at $T=50$ . Similar undercoverage of the “default” MLM standard errors for RI has been noted by Bell, Fairbrother, and Jones (Reference Bell, Fairbrother and Jones2019), Heisig, Schaeffer, and Giesecke (Reference Heisig, Schaeffer and Giesecke2017), and Jacqmin-Gadda et al. (Reference Jacqmin-Gadda, Sibillot, Proust, Molina and Thiébaut2007).
Relaxing assumptions for MLM variance estimation
If the user has strong reason to believe errors (conditionally on the random effect contributions) are spherical, then the default MLM standard errors just described would be appropriate. However, such justifications are rarely offered. Fortunately, more flexible approaches can be employed in the MLM framework, relaxing this assumption. For example, with longitudinal data, it is possible to assume an AR(1) error structure for the $\epsilon _{g[t]}$ , in which ${\text {cor}}(\epsilon _{g[t]}, \epsilon _{g[t + k]}) = \rho ^{k}$ for $\rho \in (-1, 1)$ . Or, one may allow ${{\rm var}}(\epsilon _{g[i]} \ | \ X, Z) = \sigma _g^2$ , where $\sigma ^2_g$ can differ by group, to accommodate heteroskedasticity by group. One may also allow a common unstructured $\Sigma _g$ across g, which makes no assumptions on the intragroup covariance and assigns a separate parameter to each ${\text{cov}}(\epsilon _{g[i]}, \epsilon _{g[i']})$ .Footnote 21
Using such specialized error structures when estimating standard errors may be advisable when researchers can defend the corresponding assumptions. On the other hand, users often cannot claim to know the correct variance structure based on theory alone. Fortunately, cluster-robust standard errors (CRSE), popular in conjunction with FE, provide a useful low-assumption alternative by asking the user only to assume independence across groups while allowing within group covariance to be fully estimated, albeit at the cost of requiring more data.Footnote 22
Note that both MLM and CRSEs operate as though we assume zero covariance of the residuals from units in different groups, sharing the assumption
What differs is only how ${\mathbb{E}}(\epsilon ^{*}_{g[i]} \epsilon ^{*}_{g[i']} \ | \ X, Z)$ is determined. In MLM, this covariance is parametrized as described above (e.g., ${\mathbb{E}} [\epsilon ^{*}_{g[i]} \epsilon ^{*}_{g[i']} \ | \ X, Z] = \omega ^2 + \mathbb {1}_{\{i=i'\} } \sigma ^2$ in a RI model), and estimated by plugging in the parameter estimates. By contrast CRSEs are remarkable for the lack of structure they impose on these within-group covariances. For example, after estimating $\hat {\beta }$ with an OLS of Y on X, CRSEs would simply construct empirical covariance estimates
where $\hat {e}_{g[i]} = Y_{g[i]} - X_{g[i]}^{\top } \hat {\beta }$ and c is a scalar finite sample correction. In other words, while MLMs make a strict assumption on the within-group covariances, CRSEs impose among the weakest possible assumptions by simply employing an empirical estimate based on fitted residuals. Appendix A.14 provides a more detailed discussion of the CRSE structure.
Fortunately, nothing prevents MLM users from employing CRSE assumptions during variance estimation (see also Cameron and Miller Reference Cameron and Miller2015), estimating variance according to
where $\hat {e}_{g} = Y_g - X_g \hat {\beta }_{\text {MLM}}$ and V is defined in Equation (4). We discuss the choice of c in Appendix A.15.
This leads to a suprising and useful equivalance. Mirroring the equivalence between estimates of $\beta $ from bcMLM with $\Sigma = \sigma ^2 I_N$ and FE, the CRSEs for $\beta $ from both models are also equal if both use the same c (as we recommend in Appendix A.15). This fact, proven in Appendix A.16, may be surprising since the point estimates of $\hat {Y}_{g[i]}$ differ between models, and thus the overall error variance differs. This equivalence avoids debate over which method is appropriate to estimate the coefficient and standard error on the covariate of interest, since the answers will be the same.
Figure 5 illustrates the performance of CRSE with MLM in DGP 2, in which RI with $\Sigma =\sigma ^2 I_N$ misspecifies the dependence structure. Confidence intervals for $\beta _1$ using CRSEs fixes the undercoverage seen above in Figure 4 using the conventional standard errors. Coverage remains poor for $\beta _2$ with $G=15$ , with some undercoverage remaining at $G=50$ .
Guidance on CRSEs
We offer several notes regarding the appropriate use of CRSEs. First, CRSEs assume that observations in different groups have zero covariance in their residuals. Investigators must keep this in mind when choosing the level at which clustering is performed.Footnote 23 Clustering units that actually belong to different groups as if they are in the same group reduces the number of clusters but does not violate the CRSE assumption. By contrast, if units that actually have dependent residuals are labeled as if they belong to separate groups, the CRSE assumption of no between-group dependence will be violated and the results will be unreliable. In some cases, there may not be any choice of grouping that makes this assumption defensible, in which case CRSEs would not be defensible either.
Second, as in any modeling problem, there is a tradeoff between the ability to relax assumptions and the requirement for more data. Thus while CRSEs can be a substantial improvement over default RI model-based standard errors, they do so at the cost of demanding more data. The convergence of the CRSEs depends on the number of groups. Cameron and Miller (Reference Cameron and Miller2015) suggests that 20 to 50 clusters may be needed to ensure stable estimates. Naturally, this number depends upon many features of the data and no guidelines can be expected to be universally sufficient. Alternatively, when investigators can arguably defend the stricter assumptions of any less flexible covariance structure, such as an AR(1) or a simpler heteroskedastic model for example, then doing so may pay off. To this end, when the number of groups is smaller, the model-based MLM standard errors may be preferable if one can justify that the assumed dependence structure is plausible.
Comparison to practice
Most textbooks we reviewed describe alternative estimators for the variance of MLM such as auto-regressive models.Footnote 24 Empirical works using MLM we reviewed showed little attention to this issue: among articles that employed RI models to estimate a coefficient of interest, all 24 articles in education, 29 of 39 articles in political science articles, and 14 of 21 articles in sociology used the default MLM standard errors ( $\Sigma = \sigma ^2 I_N$ ) without discussion or justification. We surmise there are several reasons for this. The first is the misconception, documented above, that MLM automatically produces correct standard errors for any multilevel data structure without further consideration. Second and compounding the first, software widely used for MLM estimation does not always allow alternative variance structures besides that with $\Sigma = \sigma ^2 I$ or nonconstant $\Omega $ .Footnote 25 Finally, investigators may reasonably worry that correctly specifying covariance structures using theory or prior knowledge is not feasible, and decide to instead accept default choices. This makes the CRSE approach a particularly attractive option, at least when groups can be defined such that between-group residual dependence is arguably ruled out.
4 Conclusions
Different methodological traditions have responded to the challenges posed by grouped data with divergent solutions: FE with modified standard errors, or MLMs with random effects. To demystify their properties, we show that (i) random effects invoked in MLMs can be understood as regularized FE, explaining MLM’s improved predictive power, ability to include group-level variables, and bias problem; (ii) this bias can be addressed; and (iii) the “default” standard errors under MLM do not necessarily address all concerns with intragroup dependency in multilevel data.
We thus recommend estimating coefficients with either FE or bcMLM, with the assumption of spherical errors. In both cases, CRSEs offer a flexible approach to variance estimation, particularly if an argument can be made for independence across clusters. Fortunately, these two approaches produce identical point estimates and standard errors for the coefficients they share. Hence, for those willing to make these adjustments and focused on inference regarding a treatment coefficient, the question of whether FE or MLM is “more appropriate” is irrelevant. Both approaches sacrifice the potential efficiency gain that an uncorrected MLM would offer had its strict assumptions been true.Footnote 26 We consider this a small and acceptable price to pay to avoid the risk of bias and higher RMSE that occurs under uncorrected MLM in the presence of correlated random effects. Accordingly, we suggest these unbiased approaches rather than any approach that seeks to mix estimators (e.g., Cheng, Liao, and Shi Reference Cheng, Liao and Shi2019) or to choose between FE (or bcMLM) and uncorrected MLM based on some criterion. For example, we do not advocate for choosing uncorrected MLM when the number of observations per group is above some threshold: one cannot know how many observations will be enough for the bias (and RMSE) to become acceptably small, and any potential efficiency or accuracy gain of MLM relative to FE is diminishing in group size anyway. We similarly do not advocate for a statistical testing approach such as Hausman (Reference Hausman1978): if one is concerned that one of those estimates (from uncorrected random effects) is incorrect, then knowing whether the observed difference in the estimates is statistically significant or not is of little relevance.
Although our advice regarding CRSEs is less common, our debiasing recommendations echo Raudenbush (Reference Raudenbush2009), Bell and Jones (Reference Bell and Jones2015), and Bell, Fairbrother, and Jones (Reference Bell, Fairbrother and Jones2019). It is also consistent with the “correlated random effects” approaches in econometrics, which employ the Mundlak (Reference Mundlak1978) solution, noted in texts including Wooldridge (Reference Wooldridge2010) and Greene (Reference Greene2012). Nevertheless, such advice have gone largely unheeded in political science and education, and to some degree in sociology.Footnote 27
Finally, while FE and bcMLM produce identical results for the coefficients they share, the approaches differ in that (i) bcMLM has improved predictive (out-of-sample) accuracy for the outcome, and (ii) bcMLM retains the ability to include group-level covariates and cross-level interactions in the model. Whether users are interested in predictive accuracy from the same model in which they are interested in estimating an unbiased effect of a key covariate is a question of research goals, not addressed here. We also emphasize that the estimated coefficients for group-level covariates or cross-level interactions in bcMLM may be difficult to interpret. In addition to the usual identification concerns, the bias-correction step in bcMLM applies only to coefficients it shares with FE, and may even induce bias in those that are absent from FE. For users interested in these coefficients, bcMLM or FE regression can be followed by the appropriate per-cluster regression step.
Acknowledgements
We thank Michael Bates, Neal Beck, Jennie Brand, Kevin Esterling, Jeff Gill, Jeff Lewis, Jingyi Jessica Li, Ian Lundberg, Michael Seltzer, Brandon Stewart, attendees of the 2019 Southern California Political Methodology meeting, and anonymous reviewers for their valuable comments and suggestions.
Data Availability Statement
Replication materials are available at Hazlett and Wainstein (Reference Hazlett and Wainstein2020).
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2020.41.