Interrogating the validity of cumulative indices of environmental and genetic risk for negative developmental outcomes

Keith F. Widaman

doi:10.1017/S0954579421001097

Interrogating the validity of cumulative indices of environmental and genetic risk for negative developmental outcomes

Published online by Cambridge University Press: 13 December 2021

Keith F. Widaman

Show author details

Keith F. Widaman*: Affiliation:
University of California, Riverside, CA 92521, USA
*: Corresponding author: Keith F. Widaman, email: [email protected]

Article contents

Abstract
Forming an index of cumulative risk
Interrogating or testing severely a risk index
Example 1: Interrogating an environmental risk index
Example 2: Interrogating a genetic risk index
General discussion
Supplementary material
Data availability statement
Author contributions
Funding statement
Conflicts of interest
Ethical standards
Footnotes
References

Rights & Permissions

Abstract

Indices of cumulative risk (CR) have long been used in developmental research to encode the number of risk factors a child or adolescent experiences that may impede optimal developmental outcomes. Initial contributions concentrated on indices of cumulative environmental risk; more recently, indices of cumulative genetic risk have been employed. In this article, regression analytic methods are proposed for interrogating strongly the validity of risk indices by testing optimality of compositing weights, enabling more informative modeling of effects of CR indices. Reanalyses of data from two studies are reported. One study involved 10 environmental risk factors predicting Verbal IQ in 215 four-year-old children. The second study included an index of genetic CR in a G×E interaction investigation of 281 target participants assessed at age 15 years and then again at age 31 years for observed hostility during videotaped interactions with close family relations. Principles to guide evaluation of results of statistical modeling are presented, and implications of results for research and theory are discussed. The ultimate goals of this paper are to develop stronger tests of conjectures involving CR indices and to promote methods for improving replicability of results across studies.

Keywords

environmental risk G×E interaction genetic risk regression analysis risk indices

Type: Regular Article
Information: Development and Psychopathology , Volume 35 , Issue 3 , August 2023 , pp. 1171 - 1187

DOI: https://doi.org/10.1017/S0954579421001097 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2021. Published by Cambridge University Press

The notion of risks to optimal developmental outcomes has been a topic of considerable interest in the field of developmental psychopathology for over half a century, if not longer. Head Start programs were begun in the 1960s in an attempt to remediate environmental shortcomings for children from disadvantaged backgrounds, under the assumption that various environmental factors that presage poorer educational outcomes might be mitigated through the development of early childhood programs of instruction and enrichment (Love, Chazan-Cohen, & Raikes, Reference Love, Chazan-Cohen, Raikes, Aber, Bishop-Josef, Jones, McLearn and Phillips2007; Beatty & Zigler, Reference Beatty and Zigler2012). Several years earlier, the Collaborative Perinatal Project (CPP) was initiated in 1958 under the assumption that a continuum of reproductive casualty may account for many poor birth and early child outcomes, and over 50,000 women and their offspring were eventually recruited into the CPP (Broman, Nichols, & Kennedy, Reference Broman, Nichols and Kennedy1975; Broman, Reference Broman1987). The primary focus of the CPP was risk factors for outcomes such as cerebral palsy, intellectual disability, and low general intellectual functioning. The host of risk factors invoked in reproductive casualty involved many presumptive harmful factors spanning the prenatal (e.g., poor prenatal nutrition), perinatal (e.g., anoxia, improper use of forceps), and postnatal periods (e.g., poor infant nutrition, low SES).

Extending the scope of risk factors, Sameroff and Chandler (Reference Sameroff, Chandler, Horowitz, Hetherington, Scarr-Salapatek and Siegel1975) coined the term continuum of caretaking casualty, hypothesizing that a large number of environmental risk factors are hazards to optimal child development. The theoretical model developed by Sameroff and Chandler involved transactions among the child, the parents, and the environment within which the family was living. Sameroff and Chandler emphasized that, although many reproductive risks certainly should be considered, no adequate accounting of intellectual or other developmental deficits has stemmed directly only from reproductive risks. In the vast majority of cases, however, an array of caretaking risks appear to be required for an adequate representation of the course of impaired development during infancy, childhood, and adolescence. These caretaking risk factors characterize the environment within which the child is developing, leading to indexes often identified as indices of environmental cumulative risk (CR).

More recently, research on gene and gene X environment (G×E) interaction effects on behavior has increased exponentially as a large number of single nucleotide polymorphisms (SNPs) can be obtained from the human genome. Each SNP can be scored in a discrete fashion, indicating the number of particular alleles present for that SNP. Many studies utilizing SNPs as the basis for G×E testing have conducted analyses on only one or another of select target SNPs. However, building on successful use of environmental risk indices, some researchers have promoted the use of summative scores across multiple SNPs, which yield scores on an index of genetic CR (e.g., Belsky & Beaver, Reference Belsky and Beaver2011). More complicated analyses use results from genome-wide association studies (GWASs) to supply weights when forming a genome-wide index of genetic CR (Belsky & Harden, Reference Belsky and Harden2019).

The idea of formulating a risk index was based on observations by clinically oriented researchers. For example, Rutter (Reference Rutter1981) noted that children seem relatively resilient to experiencing one or a small number of risk factors. But, as the number of risk factors to which a child is exposed increases, the likelihood of negative developmental outcomes increases. In one of the earliest studies using a risk index for caretaking casualty, Sameroff, Seifer, Barocas, Zax, and Greenspan (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) identified 10 risk factors for low intelligence, scored each risk factor in dichotomous fashion (0 = low risk, 1 = high risk), and summed the risk factors into a cumulative index that could range from 0 to 10, representing the number of risks a child faced. They then contrasted alternate regression models, finding that the use of all 10 risk factors allowed stronger prediction of 4-year-old children’s Verbal IQ than any single predictor. Given the success of the Sameroff et al. research, the use of risk indices has burgeoned over the past three decades.

A key issue in use of a CR index is the optimal weighting of the components of the index. CR indices have been constructed in many ways across studies. Evans, Li, and Whipple (Reference Evans, Li and Whipple2013) provided a comprehensive review of theory, methodological concerns, and research findings using CR indices. The current paper builds on the work of Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987), Evans et al. (Reference Evans, Li and Whipple2013), and others to consider in more detail several issues of importance in the construction and evaluation of CR indices. Specifically, statistical methods are described for interrogating, or testing severely, the validity of CR indices as typically formed. Mayo (Reference Mayo2018) recently advocated severe testing of theoretical conjectures, arguing that severe testing can uncover problems that deserve attention, problems that might be masked with less severe testing. Furthermore, if conjectures are tested severely and survive these tests, firmer inductive support for the conjectures accrues. In a related tone, Rodgers (Reference Rodgers2019) argued that data might profitably be conceptualized as valuable capital and that an investigator spends some capital when estimating each parameter in a statistical model. In effect, one invests capital in the process of estimating each parameter, and a researcher should be vigilant to assess the return on investment in terms of the quality of the resulting model, including its estimates and their associated SEs.

The current manuscript first presents issues involved in forming a CR risk from multiple indicators, beginning with the scaling of indicators of risk. Then, methods are described for testing severely or interrogating a risk index, which involves crucial tests of how CR index composites are optimally formed. Following this, two empirical examples are described, one interrogating an index of environmental CR, and the second an index of genetic CR. Discussion revolves around recommendations for stronger testing of theoretical conjectures. If CR indices are tested more severely and informatively and pass these tests, this may also promote successful replication efforts across studies, addressing current pressing questions about the replicability of research in many areas of psychology.

Forming an index of cumulative risk

Indices of CR are formed from multiple indicators, which can be derived in multiple forms and from multiple domains. All aspects of the formation of CR indices and their use in analytic models should be scrutinized to ensure that researchers benefit most from their use.

Discrete versus continuous indicators

One of the first matters confronting researchers using CR indicators is the scaling of the indicators included in the index. In perhaps the first study to use a CR index, Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) employed discrete indicators, each scored in dichotomous fashion. Later researchers have attempted to use indicators scaled in more continuous or quantitative fashion. Each method has strengths, and each has weaknesses, all worthy of critical appraisal.

Discrete or categorical indicators

If Shakespeare had been a practicing scientist, he might have observed that “Some variables are born discrete, some achieve discreteness, and some have discreteness thrust upon ‘em” (cf. Staunton, Reference Staunton1860/1983, p. 1003; Twelfth Night, Act II, Scene 5). A variable might be considered “born discrete” if its conception, measurement, and essential nature were discrete or categorical in form. One instance of such a variable is treatment assignment in an experiment. In a simple experiment, participants are assigned randomly to a treatment or a control condition, and assignment to condition is recorded as a discrete score (e.g., 0 = control, 1 = treatment), with numeric assignment indicating group membership, not “more or less” of anything. Gene SNPs, mentioned earlier, are also candidate variables that can be considered to be “born discrete.” For example, the 5HTTLPR SNP has alleles characterized as either short (s) or long (l). Because a person inherits one of these alleles from the mother and one from the father, an individual can be characterized as ss, sl, or ll. Because the s allele has been found to confer more environmental susceptibility, the 5HTTLPR SNP is often scored discretely as 0, 1, or 2, indicating the number of s alleles at that particular location on the genome.

Ethnic group status, a component of many CR indices, is another example of a variable that might be considered “born discrete” (or naturally discrete), at least at first glance. Families in the Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) study were identified as being of White, Black, or Puerto Rican ethnicity. Participants from the latter two ethnic groups were identified as at relatively higher risk relative to White participants, given discrimination and segregation that often accompany being a member of a disadvantaged or underrepresented group. Hence, a dichotomous score of 0 = White, 1 = non-White, was used as one component of the CR index. Whether a discrete classification into mutually exclusive ethnic groups is currently optimal or will be optimal in the future, given the mixed ethnicity of many individuals, is beyond the scope of the present paper to consider in detail, but will gain in importance in the future.

As for variables that “achieve discreteness,” father absence from the home may be one example. Suppose a researcher desired an index of father involvement in a child’s life, with lower levels of paternal involvement typically associated with increased risk for negative child outcomes. If father involvement were measured as, for example, hours per day the father interacts with the child, the researcher could easily leave the variable scored in quantitative form. However, presence (= 0) or absence (= 1) of the father or father figure in the home may be simpler to measure and is an obvious, if imprecise proxy for low levels of father involvement with the child and, thus, greater risk of negative developmental outcomes.

Turning to variables that have discreteness imposed on them, several variables used by Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) are of this form. For example, mothers in the Sameroff et al. study completed three anxiety scales. Scores on these scales were standardized and summed, and the high-risk group was identified as the 25% of mothers with the highest anxiety scores. Clearly, a mother with a high level of anxiety almost certainly provides a less optimal caregiving environment for her child than a mother with a low level of anxiety. Scored as 0 (= low risk) vs. 1 (= high risk) does result in a variable for which a higher score conveys higher risk, although whether dichotomizing the quantitative form of the variable deserves critical appraisal.

Continuous or quantitative indicators

Continuous indicators are ones that can take on a relatively large number of values with no “holes” in the number line. Height is one example. Height is typically measured discretely in a practical sense (e.g., to the nearest quarter inch), even though – in theory – all possible values between any two heights are admissible. Quantitative indicators are variables for which a higher score indicates more (or less) of a characteristic. If a scale has 10 items each answered on a 1-to-5 scale, the resulting sum score is, strictly, not continuous, as only integer values are allowed. Still, such a variable is a quantitative variable, with a higher score indicating more of what is assessed than a lower score.

Psychometric experts, including MacCallum, Zhang, Preacher, and Rucker (Reference MacCallum, Zhang, Preacher and Rucker2002, Maxwell and Delaney (Reference Maxwell and Delaney1993), and Preacher, Rucker, MacCallum and Nicewander (Reference Preacher, Rucker, MacCallum and Nicewander2005), have long decried dichotomizing continuous or quantitative scores. In general, it is unwise to dichotomize a quantitative indicator of risk, rather than leaving it in its quantitative form. Given limited space and the fact that risk indices analyzed below are discrete indicators, discussion of how to deal with quantitative indicators is beyond the scope of this paper.

Problems of sample specificity

Currently, replicability of results across studies – more directly, lack of replicability of results across studies – is an issue of immense importance (Ioannidis, Reference Ioannidis2005; Simmons, Nelson, & Simonsohn, Reference Simmons, Nelson and Simonsohn2011). Efforts therefore should be made to ensure that comparable measurement operations and decisions are made across studies so that researchers can determine clearly whether research findings have been replicated. Evans et al. (Reference Evans, Li and Whipple2013) noted that standardizing scores to M = 0, SD = 1 is sample-specific and poses challenges when making comparisons across studies. This approach does place multiple variables in a given sample on the same scale, so they can be more reasonably composited into a risk index. But, if a particular sample were recruited from a high-risk subpopulation, a standardized mean of zero on X for that sample may have little comparability to a standardized mean of zero on X for a more representative sample from the population. Thus, converting to standardized scores within samples may destroy the ability to make informed comparisons across samples or studies.

The same criticism applies to the dichotomizing of scores, particularly given concerns about whether dichotomizing can be recommended. Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) dichotomized several risk dimensions so the 25% of the sample with the most “risky” scores were given a score of 1, and the remainder of the sample was given a score of 0. But, the top 25% of a sample from a very high-risk subpopulation may represent a very different level of risk than the top 25% of a representative sample from the population. If researchers want to replicate results across studies, comparable measurement operations must be implemented so that risk in one sample can be compared informatively to risk in other samples.

Interrogating or testing severely a risk index

Testing hypotheses versus interrogating a model

The standard application of hypothesis testing in psychology, using a null hypothesis statistical test (NHST) approach, is to state a null hypothesis, H₀, and a mutually exclusive and exhaustive alternative hypothesis, H_A. For example, the null hypothesis might propose that all population regression coefficients in a regression equation are simultaneously zero, and the alternative hypothesis would be that at least one population regression coefficient is not zero. Data are collected, and a regression model is estimated, allowing a test of the null hypothesis. If the null hypothesis is rejected, the alternative hypothesis can be accepted, resulting in a confirmation of the hypothesis motivating the investigation. Hence, this approach has a bias in favor of confirmation, and only the theoretically uninteresting null hypothesis is tested.

A contrasting approach has been called severe testing by Mayo (Reference Mayo2018), an approach positing that disconfirmation is the path forward in science. This approach is consistent with the ideas of Popper (Reference Popper1935/1959), Meehl (Reference Meehl1990), and others that we should test our predictions, not null hypotheses, and test them as strongly or severely as possible. Under this approach, an investigator should develop a theoretical model for phenomena in a domain and then, after collecting data, test whether that model fits the data. When testing model predictions, this testing should involve severe testing of predictions or interrogating the fit of the model. If the model fits the data adequately, the model has survived a severe test and gained inductive support. Or, if the model is rejected as having poor fit to the data (i.e., model predictions are disconfirmed), something valuable has been learned – the theory was unable to account for the data, so requires modification to be in better accord with observations.

As applied to the construction of a CR index, the standard NHST approach might test whether a CR index formed as the sum of several risk indicators was a better predictor of a negative developmental outcome than was the single best indicator of risk. The contrasting, severe testing approach would test as directly and severely as possible whether equal weighting of risk indicators was acceptable or should be rejected in favor of a more complex and adequate form of weighting. If equal weighting cannot be rejected, the “equal weights” hypothesis has been interrogated and passed a severe test, supporting this simple weighting scheme.

Summing indicators into a CR: The importance of metric

Once indicators of risk are identified, a researcher is confronted with the question of how to sum or combine multiple indicators into an index of CR. Evans et al. (Reference Evans, Li and Whipple2013) summarized typical approaches in the form of two options: First, if continuous risk indices are available and are in different metrics, one can standardize and sum the variables, although this makes sense only if risk indicators are correlated. Second, if risk indicators are not highly interrelated, researchers could dichotomize each indicator and then sum these into a cumulative index of risk.

Differing with Evans et al. (Reference Evans, Li and Whipple2013), I contend that the degree of correlation among risk indicators should play no role in deciding how to transform and sum variables. The sum of a set of indicators – quantitative or discrete, correlated or uncorrelated – can have substantial reliability in terms of stability over time and validity even if the sum has poor internal consistency (or homogeneity) reliability at a given point in time (Revelle & Condon, Reference Revelle and Condon2019).

The more important concern is the comparability of the metric of indicators. If one intends to sum a set of indicators, indicators must be on the same metric or approximately the same metric so indicators will have comparable contribution to the sum. The core criterion to satisfy for two measures to be considered to be on the same metric is that a one-unit increase on one variable is comparable or essentially identical to a one-unit increase on the other variable. Standardizing each of a set of quantitative indicators to M = 0, SD = 1 achieves a common metric across indicators, as a one-unit increase on any variable standardized in this fashion is a 1.0 SD increase. But, dichotomously scored variables, if scored in “0 vs. 1” fashion, also fall on a comparable metric across indicators, so the sum of such variables has ready interpretation. Once indicators are on the same metric, summing is reasonable and is often strongly recommended.

Indeed, summing a set of uncorrelated indicators can have advantages, such as leading to greater simplicity in analyses when predicting an outcome variable. Then, only a single predictor – the summed index of CR – is used as the representation of risk, so only a single regression weight for risk is estimated, rather than one regression weight for each of the separate indicators. To reiterate, the sum of a set of uncorrelated indicators may have substantial reliability in an “over time,” stability sense, and considerable validity, even if little internal consistency. So, regardless of the degree of correlation among risk indicators, the sum of indicators may have important benefits for analysis and interpretation.

To weight differentially or not: Interrogating a weighting sccheme

Equal versus differential weighting of risk indicators when forming a sum is a key issue. If multiple risk indicators are on the same metric, summing the indicators may serve a legitimate scientific purpose regarding equal versus differential weighting of indicators, regardless of their degree of intercorrelation. Consider an outcome variable Y and two risk indicators, X ₁ and X ₂, which are assumed to be on the same metric (e.g., standardized to have equal means and equal variances, or both 0 – 1 dichotomies). A regression equation could be written as:

(1)

$${Y_i} = {B_0} + {B_1}{X_{i1}} + {B_2}{X_{i2}} + {E_i}$$

where Y _i is the score of person i on the outcome variable, X _i1 and X _i2 are scores of person i on the two risk indicators, respectively, B ₀ is the intercept, B ₁ and B ₂ are the raw score regression coefficients for the two risk indicators, respectively, and E _i represents error in predicting Y for person i. This equation would have a squared multiple correlation, or R ², that indicates the proportion of variance in Y accounted for by the weighted predictors. [Note: to ease presentation, the subscript i for person will be deleted from subsequent equations, with no loss in generality.].

Regardless of the degree of correlation between the two risk indicators, summing the two indicators would lead to the following equation:

(2)

$${Y_{}} = {B_0} + {B_C}({X_1} + {X_2}) + {E^*}$$

where ${B_C}$ represents the raw score regression weight constrained to equality across the two risk indicators, E* represents the prediction error in this equation, and other symbols were defined above. Based on parameter nesting, Equation 2 is nested within Equation 1, as Equation 2 places an equality constraint on the two regression coefficients in Equation 1. Given the nesting, one could test the difference in explained variance for the two equations with the typical F-ratio for nested regression models (cf. Cohen, Cohen, West, & Aiken, Reference Cohen, Cohen, West and Aiken2003, p. 89), as:

(3)

$${F_{\left( {{p_1} - {p_2}, N - {p_1}} \right)}} = {{\left( {R_{Eq1}^2 - R_{Eq2}^2} \right)/({p_1} - {p_2})} \over {(1 - R_{Eq1}^2)/(N - {p_1})}}{\rm{ }}$$

where $R_{Eq1}^2{\rm{ \hskip1pt and\hskip1pt }}R_{Eq2}^2$ are squared multiple correlations under Equation 1 and Equation 2, respectively, ${p_1}\hskip1pt{\rm{ and }\hskip2pt}{p_2}$ are the number of regression slopes estimated in the two equations, respectively, and N is sample size. The resulting F ratio has (p ₁ – p ₂) and (N – p ₁) degrees of freedom and tests the hypothesis that the two regression slopes in Equation 1 are equal. If the F ratio were larger than the critical value at a pre-specified level (e.g., α = .05), the hypothesis of equality of regression coefficients could be rejected, and differential weighting of risk indicators is justified. On the other hand, if the F ratio did not exceed the critical value, the hypothesis of equality of regression coefficients cannot be rejected, and equal weighting is appropriate.

Note that Equations 1 and 2 easily generalize to situations with more than two risk indicators, but only if all indicators are on the same metric. If one had 10 risk indicators, an initial equation would have 10 predictors ${X_1}{\rm{ through }}{X_{10}}$ each with its own regression weight, as:

(4)

$$Y = {B_0} + {B_1}{X_1} + {B_2}{X_2} + \ldots + {B_{10}}{X_{10}} + E$$

where symbols were defined above, and Equation 2 would then become:

(5)

$$Y = {B_0} + {B_C}({X_1} + {X_2} + \ldots + {X_{10}}) + {E^*}$$

where ${B_C}$ represents the raw score regression weight constrained to equality across all 10 risk indices, and other symbols were defined above. The proper adaptation of Equation 3 would then provide an omnibus test of equality of the regression weights across all 10 risk indicators, an F-ratio that, in this case, would have 9 and (N – 11) degrees of freedom. If the resulting F ratio were nonsignificant, the hypothesis that the regression weights were simultaneously equal could not be rejected, justifying equal weighting of all 10 indicators when forming a CR index. Of course, if the F ratio were significant, the hypothesis of equality of regression weights would be rejectable, and a more informed and complex form of weighting might be considered. If equality of all regression weights were rejected, theory or the pattern of weights when all were freely estimated might offer options for more complex, but still restricted weighting schemes.

In evaluating comparisons such as those outlined in Equations 1 through 5, sample size and power to reject the hypothesis of equality of regression weights must be considered. As sample size increases, power to reject the “equal weights” hypothesis will increase. With extremely large sample size, an “equal weights” hypothesis might be rejectable via statistical test, even if the differences in the regression weights across predictors are trivial in magnitude. Hence, some notion of practical significance of the difference must also be weighed, whether of the magnitude of the differences in the raw score regression weights or the difference in the R ² values for the equations. Conversely, small sample size and resulting low power may lead to difficulty in rejecting the hypothesis of equality of regression weights even if these weights, in truth, differ in the population, essentially committing a Type II error. But, with small sample size, it may be more appropriate to proceed in the presence of a possible Type II error rather than promote differential weighting that cannot be justified statistically.

Equality versus differential weighting of predictors in regression analysis has surfaced as an issue with regularity over time (e.g., Wilks, Reference Wilks1938; Wainer, Reference Wainer1976; Dawes, Reference Dawes1979). Many have argued that equal weights are expected to lead to little loss in predictive accuracy relative to differential weights in many situations. Regrettably, this literature is beset with conflicting claims, much too voluminous to review here. In a different, though related vein, experts recently have discussed the fungibility, or exchangeability, of coefficients in regression (e.g., Waller, Reference Waller2008) and structural equation models (e.g., Lee, MacCallum, & Browne, Reference Lee, MacCallum and Browne2018). These researchers have cautioned against having too much faith in precise, optimal regression weights. For example, in a regression equation with three or more predictors, if a small drop in explained variance is allowed, an infinite number of different sets of regression weights all produce the same R ², and many of these sets of regression weights may have little similarity to the optimal least squares estimates in the equation that maximizes R ². Thankfully, if equal weighting of predictors is justified, the problem of fungible regression weights largely vanishes, as the number of regression weight estimates is reduced and the flexibility of the equation is curtailed.

In prior work on CR indices, equal weighting of risk indicators has virtually always been utilized. Equal weights may not be optimal in a least squares sense, but may be preferred on several grounds, including parsimony, efficiency, and openness to replication in subsequent studies. However, the issue of equal versus differential weighting of risk indicators should be interrogated or subjected to severe test, as this may impact the relations between a CR index and outcomes it should predict, so has a bearing on the construct validity of the CR index.

Three guiding principles

When formulating and then interrogating a cumulative index derived from a number of risk indicators, the weighting and summing of indicators should be guided by justifiable analytic principles. Three principles are here proposed.

Principle one: Do no harm overall

This principle represents the admonition that a model employing a CR index formed with equal weighting of risk indicators should not lead to a substantial reduction in model fit relative to that explained by a model with CR indicators having differential weights. A CR index can function as direct (or main) effect or as a component of an interaction when predicting a negative developmental outcome. If a model with an equally weighted CR index has fit similar to a model in which risk indicators have differential weights, no harm overall has occurred. But, if a model with an equally weighted CR index has worse fit than a model with differential weights for risk indicators, overall harm has occurred, and the a priori equal weighting of indicators into the CR index should be reconsidered and rejected.

Principle two: Do no harm in particular

Here, the concern is with each individual component of a CR index. If risk indicators are equally weighted when forming an index of CR, some method should be used to determine whether the equal weighting has distorted the predictive effect of each risk indicator. If one or more risk indicators are compromised by the equal weighting, the formulation of the CR index should be reconsidered.

Principle three: Do some good

As Evans et al. (Reference Evans, Li and Whipple2013) noted, the use of a CR index has a number of beneficial effects, including simplicity and reduction of potential multicollinearity if the components of the CR index had been used as separate, correlated predictors. Certainly, the use of a CR index will reduce the number of estimated regression slopes – a notable simplification – and may lead to improvements in other aspects of the equation, such as smaller standard errors of parameter estimates. If all or most of these improvements should occur, considerable good would have been bought by the use of the CR index.

Example 1: Interrogating an environmental risk index

Background

The study by Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) was one of the first, if not the first, to use a CR index of environment risk for low intelligence. The sample consisted of 215 mothers and children, and child WPPSI Verbal IQ at age 4 years was the outcome variable. The 10 dichotomously scored risk factors are shown in Table 1. Certain variables (e.g., ethnic status, family support) were scored in direct and unambiguous fashion, and others (e.g., occupation, education) were dichotomized at commonly used points on their respective continua. For the remaining six variables, Sameroff et al. identified cut-scores that would leave about 25% of the sample with the highest risk receiving a risk score of 1, with the remaining 75% being assigned a risk score of 0. Additional details about each of the predictors were provided by Sameroff et al.

Table 1. Ten Risk Factors Used by Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987)

Note: Tabled material is adapted from Table 2 of Sameroff et al. with minor modification.

In Table 2, correlations among the 10 risk factors and the WPPSI Verbal IQ outcome variable are shown, along with estimated means and SDs of the variables (see Supplementary Material for how information from Sameroff et al., Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987, was used to develop Table 2). Inspection of Table 2 reveals that all 10 risk factors have negative correlations with WPPSI Verbal IQ, consistent with mean differences between low-risk and high-risk groups reported by Sameroff et al. The correlations in Table 2 also exhibit a strong positive manifold of correlations, with 44 of the 45 correlations among risk factors of positive valence. The single exception was the small negative correlation, r = −.03, between maternal mental health and ethnic status.

Table 2. Correlations and Descriptive Statistics for Child Verbal IQ and 10 Risk Variables from Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987)

Note: N = 215. The correlations in the table above have been re-arranged from those reported by Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987). As explained in the Appendix, three risk factor variables were reverse scored, and the correlation between WPPSI Verbal IQ and Occupation of r = −.59 as reported in text of the Sameroff et al. article was used, replacing the r = −.58 in their Table 3.

The risk factors shown in Table 1 are a varied amalgam of risk factors. The first four risk factors in Table 1 will hereinafter be called the Classic 4,^{Footnote 1} given substantial research published over the past half century or longer on relations of these variables to child intelligence. From early work by Terman (Reference Terman1916) and Brigham (Reference Brigham1923), down through work by Broman (Reference Broman1987; Broman et al., Reference Broman, Nichols and Kennedy1975), Jensen (Reference Jensen1998) and Herrnstein and Murray (Reference Herrnstein and Murray1994), and recently Johnson, Brett, and Deary (Reference Johnson, Brett and Deary2010), researchers have studied relations of SES (occupation), education, and ethnicity with intelligence. The fourth indicator in the Classic 4 set – maternal interaction – is also based on substantial work. Yarrow and associates (e.g., Messer, Rachford, McCarthy, & Yarrow, Reference Messer, Rachford, McCarthy and Yarrow1987; Yarrow, MacTurk, Vietze, McCarthy, Klein, & McQuiston, Reference Yarrow, MacTurk, Vietze, McCarthy, Klein and McQuiston1984; Yarrow, Rubenstein, & Pedersen, Reference Yarrow, Rubenstein and Pedersen1975) reported strong relations between parental stimulation and interaction during a child’s infancy and young child problem-solving. Other work by Bradley and Caldwell and colleagues (Bradley & Caldwell, Reference Bradley and Caldwell1980; Bradley, Caldwell, & Elardo, Reference Bradley, Caldwell and Elardo1977; Elardo, Bradley, & Caldwell, Reference Elardo, Bradley and Caldwell1975) found strong relations (e.g., correlations ranging from .36 to .66) between observed mother-child interaction during the child’s infancy and child IQ at 3 years.

In contrast to the Classic 4, the remaining six risk variables, here referred to as the Modern 6, have received much less research attention with regard to predicting children’s intelligence. Certainly, father absence, high maternal anxiety, high numbers of difficult life events, and the remaining indicators likely confer risk for poorer development in general. But, these six indicators have received far less consistent attention as predictors of low child IQ as have the first four risk indicators.

Regression analyses

Because the 215 observations had complete data on all variables, the summary data in Table 2 can be used to perform regression analyses. For all multiple regression analyses reported here, I used ordinary least squares (OLS) estimation in the PROC REG program in SAS (all analysis scripts are contained in Supplementary Material). To evaluate model fit, I used the R ² for a model and differences in R ² for competing models, the F-test for differences in fit of nested models, the adjusted R ² (adj-R ², which has a penalty for model complexity), and the Schwarz Bayesian Information Criterion (BIC).^{Footnote 2} BIC values are not on a standardized metric, but lower values indicate better model fit.

Replicating results reported by Sameroff et al

Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) reported that the single best predictor of Child Verbal IQ was occupation, R ² = .35. They noted that including all 10 risk factors led to a much higher level of explained variance, R ² = .51. Unfortunately, Sameroff et al. did not present any additional details, such as parameter estimates, and their SEs, for this 10-predictor model, so it is not possible to evaluate their reported results further.

I first wanted to replicate results reported by Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) to verify that data in Table 2 were sufficient to reproduce their results. Model 1, shown in Table 3, used occupation as the sole predictor of child Verbal IQ and led to an R ² = .348. Adding the nine remaining risk factors led to Model 2 in Table 3, which had R ² = .519. Both of these models replicated closely R ² values reported by Sameroff et al. Note that, in Model 2, only the Classic 4 predictors had regression weights significant at p < .05, and none of the Modern 6 met this criterion.

Table 3. Alternative Regression Models for the Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) Data

Note: N = 215. Tabled values are raw score regression coefficients, their SEs in parentheses, and associated t-ratios. The R ² is the squared multiple correlation; the adjusted R ² is the shrunken estimate of squared multiple correlation that adjusts for model complexity. ^a The t ratio has df equal to ν ₂ for the F ratio for the equation, shown below. With 204 or more df, the critical t value at the .05 level is 1.98. ^b For the F ratio for each model, ν ₁ = numerator degrees of freedom, and ν ₂ = error (or denominator) degrees of freedom. ^c BIC is the Schwarz Bayesian Information Criterion.

Interrogating weightings of all 10 indicators

Two a priori restricted models for the 10 risk indicators were candidates for interrogation or severe testing. The first model was one in which the regression weights for all 10 risk factors are constrained to equality. If one intended to employ an equally weighted sum of the 10 risk factors, as Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) did later in their article, the resulting composite used as predictor in a regression model would lead to results that would be identical to a model with 10 predictors, but with regression weights for all 10 predictors constrained to equality. Given multicollinearity among risk indicators, constraining all regression weights to equality could have little effect on model fit, but deserves testing. The PROC REG program has an option for constraining regression weights; if constraints are imposed, tests of constraints or restrictions are also supplied.^{Footnote 3} Constraining all 10 regression weights to equality led to Model 3. Because 10 regression weights were constrained to equality, only a single regression weight was estimated, so 9 constraints or restrictions on weights were imposed. Results are shown in Table 3, which shows that Model 3 had an R ² = .442, a noticeable drop in explained variance relative to Model 2, ΔR ² = − .077, that was significant, F(9, 204) = 3.66, p = .0003. Some good was done, as the constrained estimate of the raw score regression coefficient, B = − 4.65, SE = 0.36, was accompanied by a much smaller standard error than was obtained by predictors in Model 2, but the overall drop in fit was troubling. Moreover, 8 of the 9 tests of constraints (see Restrictions 1 through 9 in Table 3) led to significant t-ratios (p < .05), suggesting that an equality constraint on regression coefficients across all 10 predictors led to too great a restriction on many of the regression coefficients. The upshot was that a model with all 10 regression weights constrained to equality, when tested severely, was rejected.

Given rejection of the first a priori model, the second a priori model to be interrogated was one in which the Classic 4 risk factors had regression weights constrained to equality, the Modern 6 risk factors had weights constrained to equality, but the weights for the Classic 4 and Modern 6 risk factors could differ. This model is termed Model 4, which had an R ² = .505. Model 4 is nested within Model 2 because it makes 8 fewer estimates, and the comparison with Model 2 represents a severe test of the highly restricted Model 4. For Model 4, the drop in explained variance relative to Model 2, ΔR ² = − .014, was very small in magnitude and not statistically significant, F(8, 204) = 0.74, p = .66. Moreover, as shown in Table 3, not one of the 8 constraint tests in Model 4 was statistically significant, supporting the contention that this pattern of constraints did not compromise any single regression weight estimate. The adjusted R ², adj-R ² = .500, for Model 4 was the highest adjusted R ² for any of the four models, attesting to the efficiency of Model 4 and its estimates. In addition, the BIC for Model 4 was lower than comparable BIC values for the first three models, implying Model 4 was the optimal model for the data. The superior fit of Model 4 relative to the more highly parameterized Model 2 represents superior return on investment in the efficient estimation of parameters, in the terms outlined by Rodgers (Reference Rodgers2019).

The regression coefficients and their SEs for Model 4 are shown in Table 3. The first four risk indicators had rather large coefficients, B = − 8.60, SE = 0.83, p < .0001, and the remaining six risk indicators had coefficients about one-sixth as large, B = − 1.43, SE = 0.71, p = .044, that just met the α = .05 criterion. So, all regression weights in Model 4 were statistically significant. Consider next the comparison of Models 3 and 4. In Model 4, the Classic 4 and Modern 6 risk indicators had different constrained estimates, whereas in Model 3 these estimates were constrained equal. Because the latter equality constraint led to a rather large and significant drop in fit, F(1, 212) = 27.21, p < .0001, this comparison supports the conclusion that the coefficients for Classic 4 and Modern 6 risk indicators in Model 4 differed significantly at p < .0001.

Comparisons among the four regression models satisfy the three principles set forth earlier. First, Model 4 did no harm overall, as it led to a small and nonsignificant drop in R ² relative to the most highly parameterized model, Model 2. Second, Model 4 did no harm in particular, because not one of the tests of parameter restrictions in Model 4 was significant. Thus, the constraints did not affect the ability of any one of the predictors to contribute to prediction of the outcome variable. Third, Model 4 did some good, as the SEs of parameter estimates were much reduced from the values estimated under Model 2, and all regression coefficients were statistically significant.

Structural equation modeling

I conducted comparable analyses to those reported in Table 3 using structural modeling software (SEM) programs Mplus (Muthén & Muthén, Reference Muthén and Muthén1998-2019) and lavaan (Rosseel, Reference Rosseel2012) in R. Essentially identical results were obtained; given limitations of space, the full set of analysis scripts and descriptions of SEM results were placed in Supplementary Material.

Discussion

The results of re-analyses of the Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) data have two major implications, one more methodological and the other more substantive. The first, more methodological implication is that a single risk index created as the equally weighted sum of all risk factors may not be optimal for all domains of negative developmental outcome. That is, more differentiated summative indices of CR may have substantial analytic benefits. If researchers were to form the two separate unit-weighted CR indices suggested by the re-analysis, they could evaluate whether the same differential pattern of relations with child intelligence held in other samples. Many additional analytic options readily come to mind. For example, one could form a product of the two indices, representing an interaction of the Classic 4 and the Modern 6, to see if the effect of one of the CR indices varied as a function of the other. Thus, the effect of the CR index with larger predictive influence might be moderated by the number of risks to which the child was exposed comprised by the other CR index (e.g., father absence, high maternal anxiety, etc.).

The second, more substantive implication is that division into more than a single CR index might allow a clearer interpretation of results in the context of prior research. The re-analyses of Sameroff et al. (Reference Sameroff, Seifer, Barocas, Zax and Greenspan1987) data suggested the presence of two sets of risk factors, which I called the Classic 4 and Modern 6, for predicting child IQ. Importantly, the “Classic 4 vs. Modern 6” contrast of the 10 risk indicators may hold only for predicting child IQ or other forms of ability, such as achievement test scores or school grade point average. Developmentalists study a very broad array of consequential outcomes, including child mental health, conduct disorder problems, peer relations, attachment, and so forth. When researching these other domains of behavior, the Modern 6 might have predictive power that is equal to or even substantially stronger than the predictive power of the Classic 4. The relative predictive power of the two separate indices of CR across different domains of child and adolescent behavior might lead to more productive insights into relations between risk factors and child development than forcing all indicators into a single index that offers less flexibility in modeling.