Madole & Harden (M&H) provide an interesting and wide-ranging argument on the senses in which genetically informed studies may be said to produce causal inference. We focus on their discussion of genome-wide association studies (GWASs) of educational attainment (EA). The authors summarize this evidence as:
Genes might cause EA in the sense that genes made some distal difference in level of attainment, but not in the sense that they provide an explanation for how this difference was made.
The paper's claims amount to arguing that (1) GWAS type studies produce shallow causal evidence and (2) such evidence can help direct research that seeks mechanisms, that is, explanations. Our argument is that both objectives need theory to be successfully met.
From the perspective of statistical/econometric theory, EA is an example of an outcome determined via a system of interactions, that is, it is a dependent variable in a simultaneous equations system. This is evident from the basic logic of economic models of EA. Suppose that individuals come in types T i determined by genotype g i, an unobservable η i capturing in utero effects, and so on. Suppose that individuals experience family influences F i, social influences S i, which are determined by one another, the types and associated unobservables ν i and ξ i. Together, background factors obey a simultaneous system:
Suppose, in turn, that type, family, and social factors combine with unobservables δ i to produce choices C i (e.g., effort)
and that EA i is determined by background factors, choices, and unobservables ψ i
M&H argue for the informational value of the conditional probability of EA given genotype:
Variation in (6) for different g i value demonstrates, in the authors' sense, a causal relationship between genes and education.
We focus, instead, on what is learned about (1)–(5) from (6). From the perspective of economic theory, (1)–(5) is a simultaneous equations system, while (6) is a reduced form quantity that summarizes aspects of the data. The classic identification problem of simultaneous equations systems asks what features of the structural relations (1)–(5) are revealed by reduced form evidence such as (6). The simultaneous equations perspective, as has been long understood, reveals the need for a priori assumptions to make credible empirical claims about structural relationships from reduced form ones. How does that logic apply to EA?
First, the determination of mechanisms that produce EA requires a priori assumptions on the structure producing between the joint density of all observables. Examples of such assumptions include exclusion restrictions that represent ways to delimit the paths that link different endogenous and predetermined variables. This is the first sense in which social science theory is needed: Deep causal claims require credible a priori assumptions and economic theory provides precisely that. Any search for explanations needs to be theoretically informed.
Second, claims that (6) reveals statistical causality implicitly depend on the structure (1)–(5) that produces (2). M&H draw analogies between randomized controlled trials and genomic analyses, arguing that the genetic lottery acts as a randomization device. But the information involved in (6) does not translate into interpretable objects of any type unless one has made background assumptions about (1)–(5). Conditions for causal inference, such as the single unit value treatment assumption and strong ignorability are statements about the properties of a system. Randomized controlled trials, for example, succeed because the assignment mechanism rules out certain pathways linking a treatment to outcomes.
The necessity of theory is illustrated by comparing GWAS evidence on EA with the M&H example on the causal effect of lithium on depression. They defend empirical claims of causal links between lithium and reduced depression even though the biological pathway from lithium to mental state is not understood. We see essential differences with EA. The lithium evidence is compelling, despite the absence of clear biological pathways, because the randomization can balance family and social factors.
In contrast, computation of polygenic scores for EA do not allow one to conclude that changing the polygenic score of a given person, would (in the probabilistic sense of (6)) change their distribution of EA, unless, one has taken a stance on family and social factor processes in (1) that are induced by genotypes. As often noted, genomic correlations with EA could reflect discrimination as opposed to some intrinsic academic ability. Without a theory of these pathways, we do not see how (6) answers substantive questions.
M&H may answer that we are eliding the shallow statistical causality concept with the deeper explanation-based causality concept that they acknowledge is not revealed by (6). We see the issues differently: While the lithium experiments produced useful knowledge in the sense of Marschak (Reference Marschak, Hood and Koopmans1953) we do not see the same applying to EA. First, while the lithium studies were policy-relevant, that is, led to recommendations on treatment, the same is not true for (6). We see no way of mapping (6), in isolation, to any policy implications if one wishes to rectify inequalities, promote fair equality of opportunity, and so on, without knowledge of mechanisms. The same claim with respect to whether (6) can lead to useful knowledge about mechanisms per se – this is the classic failure of identification in simultaneous equations systems without a priori assumptions. An individual's genotype, as it is associated with different family and social pathways, does not admit a reduction of the set of potential mechanisms determining EA, let alone their magnitudes, based on (6) alone. While we are inexpert in how heterogeneity in lithium effects facilitated the search for biological pathways, we suspect that prior biological knowledge was required to do this.
We applaud M&H for beginning the process of integrating genomic research with the existing literatures in statistics and econometrics. We endorse their call to use statistical causal findings on genotypes to help guide the search for explanations. Where we differ is that we believe social science areas such as education require social science behavioral models. Genomic data may help with the identification, as it provides observables that help reveal unobserved individual types (in the sense of Eq. (1)), but it cannot succeed alone.
Madole & Harden (M&H) provide an interesting and wide-ranging argument on the senses in which genetically informed studies may be said to produce causal inference. We focus on their discussion of genome-wide association studies (GWASs) of educational attainment (EA). The authors summarize this evidence as:
Genes might cause EA in the sense that genes made some distal difference in level of attainment, but not in the sense that they provide an explanation for how this difference was made.
The paper's claims amount to arguing that (1) GWAS type studies produce shallow causal evidence and (2) such evidence can help direct research that seeks mechanisms, that is, explanations. Our argument is that both objectives need theory to be successfully met.
From the perspective of statistical/econometric theory, EA is an example of an outcome determined via a system of interactions, that is, it is a dependent variable in a simultaneous equations system. This is evident from the basic logic of economic models of EA. Suppose that individuals come in types T i determined by genotype g i, an unobservable η i capturing in utero effects, and so on. Suppose that individuals experience family influences F i, social influences S i, which are determined by one another, the types and associated unobservables ν i and ξ i. Together, background factors obey a simultaneous system:
Suppose, in turn, that type, family, and social factors combine with unobservables δ i to produce choices C i (e.g., effort)
and that EA i is determined by background factors, choices, and unobservables ψ i
M&H argue for the informational value of the conditional probability of EA given genotype:
Variation in (6) for different g i value demonstrates, in the authors' sense, a causal relationship between genes and education.
We focus, instead, on what is learned about (1)–(5) from (6). From the perspective of economic theory, (1)–(5) is a simultaneous equations system, while (6) is a reduced form quantity that summarizes aspects of the data. The classic identification problem of simultaneous equations systems asks what features of the structural relations (1)–(5) are revealed by reduced form evidence such as (6). The simultaneous equations perspective, as has been long understood, reveals the need for a priori assumptions to make credible empirical claims about structural relationships from reduced form ones. How does that logic apply to EA?
First, the determination of mechanisms that produce EA requires a priori assumptions on the structure producing between the joint density of all observables. Examples of such assumptions include exclusion restrictions that represent ways to delimit the paths that link different endogenous and predetermined variables. This is the first sense in which social science theory is needed: Deep causal claims require credible a priori assumptions and economic theory provides precisely that. Any search for explanations needs to be theoretically informed.
Second, claims that (6) reveals statistical causality implicitly depend on the structure (1)–(5) that produces (2). M&H draw analogies between randomized controlled trials and genomic analyses, arguing that the genetic lottery acts as a randomization device. But the information involved in (6) does not translate into interpretable objects of any type unless one has made background assumptions about (1)–(5). Conditions for causal inference, such as the single unit value treatment assumption and strong ignorability are statements about the properties of a system. Randomized controlled trials, for example, succeed because the assignment mechanism rules out certain pathways linking a treatment to outcomes.
The necessity of theory is illustrated by comparing GWAS evidence on EA with the M&H example on the causal effect of lithium on depression. They defend empirical claims of causal links between lithium and reduced depression even though the biological pathway from lithium to mental state is not understood. We see essential differences with EA. The lithium evidence is compelling, despite the absence of clear biological pathways, because the randomization can balance family and social factors.
In contrast, computation of polygenic scores for EA do not allow one to conclude that changing the polygenic score of a given person, would (in the probabilistic sense of (6)) change their distribution of EA, unless, one has taken a stance on family and social factor processes in (1) that are induced by genotypes. As often noted, genomic correlations with EA could reflect discrimination as opposed to some intrinsic academic ability. Without a theory of these pathways, we do not see how (6) answers substantive questions.
M&H may answer that we are eliding the shallow statistical causality concept with the deeper explanation-based causality concept that they acknowledge is not revealed by (6). We see the issues differently: While the lithium experiments produced useful knowledge in the sense of Marschak (Reference Marschak, Hood and Koopmans1953) we do not see the same applying to EA. First, while the lithium studies were policy-relevant, that is, led to recommendations on treatment, the same is not true for (6). We see no way of mapping (6), in isolation, to any policy implications if one wishes to rectify inequalities, promote fair equality of opportunity, and so on, without knowledge of mechanisms. The same claim with respect to whether (6) can lead to useful knowledge about mechanisms per se – this is the classic failure of identification in simultaneous equations systems without a priori assumptions. An individual's genotype, as it is associated with different family and social pathways, does not admit a reduction of the set of potential mechanisms determining EA, let alone their magnitudes, based on (6) alone. While we are inexpert in how heterogeneity in lithium effects facilitated the search for biological pathways, we suspect that prior biological knowledge was required to do this.
We applaud M&H for beginning the process of integrating genomic research with the existing literatures in statistics and econometrics. We endorse their call to use statistical causal findings on genotypes to help guide the search for explanations. Where we differ is that we believe social science areas such as education require social science behavioral models. Genomic data may help with the identification, as it provides observables that help reveal unobserved individual types (in the sense of Eq. (1)), but it cannot succeed alone.
Acknowledgment
A.R. acknowledges the U.S. Department of Defense, contract W911NF2010242.
Financial support
This research received no specific grant from any funding agency, commercial. or not-for-profit sectors.
Competing interest
None.