Hostname: page-component-586b7cd67f-2plfb Total loading time: 0 Render date: 2024-11-22T04:26:08.093Z Has data issue: false hasContentIssue false

Causal History, Statistical Relevance, and Explanatory Power

Published online by Cambridge University Press:  13 April 2023

David Kinney*
Affiliation:
Yale University, New Haven, CT, USA
Rights & Permissions [Opens in a new window]

Abstract

In discussions of the power of causal explanations, one often finds a commitment to two premises. The first is that, all else being equal, a causal explanation is powerful to the extent that it cites the full causal history of why the effect occurred. The second is that, all else being equal, causal explanations are powerful to the extent that the occurrence of a cause allows us to predict the occurrence of its effect. This article proves a representation theorem showing that there is a unique family of functions measuring a causal explanation’s power that satisfies these two premises.

Type
Contributed Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the Philosophy of Science Association

1. Introduction

Several authors in philosophy of science have argued that, all else being equal, a causal explanation is good to the extent that it provides a detailed description of the causal history of why the event being explained (i.e., the explanandum) occurred. Consider the example from Railton (Reference Railton1981, 250):

For any given gas, its particular state $S$ at a time $t$ will be determined solely by its molecular constitution, its initial condition, the deterministic laws of classical dynamics operating upon this initial condition, and the boundary conditions to which it has been subject. Therefore, the ideal explanatory text for its being in state $S$ at time $t$ […] will be a complete causal history of the time evolution of that gas.

The idea being expressed here is that the ideal causal explanation of why a gas ends up in state $S$ at a time $t$ is the full causal history of the gas’ evolution from some state $S'$ , at some previous time $t'$ , to its state $S$ at $t$ . From this exemplar of an ideal causal explanation, one can make the further inference that causal explanations in general are good or powerful to the extent that they approximate this ideal. One finds a similar idea expressed by Salmon (Reference Salmon1984), who holds that in many cases, good explanation “involves the placing of the explanandum in a causal network consisting of relevant causal interactions that occurred previously and suitable causal processes that connect them to the fact-to-be-explained” (269). Similarly, Lewis (Reference Lewis and Lewis1986, 217) defends the thesis that “to explain an event is to provide some information about its causal history.” Keas (Reference Keas2018) also defends the idea that, all else being equal, scientific explanations are good to the extent that they trace the causal history of an event back as far as possible, calling this feature of an explanation “causal history depth.”

On the other hand, there is also widespread agreement in the literature that, all else being equal, explanations are powerful to the extent that learning the facts that explain an explanandum would allow us to predict the occurrence the explanandum, if we didn’t know that it had occurred. This assumption is made explicit in attempts to formalize explanatory power due to Schupbach and Sprenger (Reference Schupbach and Sprenger2011) and Crupi and Tentori (Reference Crupi and Tentori2012). Moreover, Eva and Stern (Reference Eva and Stern2019) provide a specific formalization of the explanatory power of causal explanations by assuming that, all else being equal, a causal explanation is powerful to the extent that learning that an intervention has brought about a particular cause of an event would allow us to predict the occurrence of that event. Let us call this feature of a causal explanation its “causal statistical relevance.”

These two putative good-making features of a causal explanation can be in tension with one another. Consider the following example (Eva and Stern Reference Eva and Stern2019, 1047–8):

Ettie: Ettie’s Dad went to see the local football team play in a crucial end of season match. Unfortunately, Ettie was busy on the day of the game, so she couldn’t go with him. On her way home, she read a newspaper headline saying that the local team had lost. When she got home, she asked him “Dad, why did we lose?”, to which her witty father replied “because we were losing by fifty points when the fourth quarter started.” Understandably, Ettie still wanted to better understand why her team lost, so she asked her Dad why they were down by so much entering the fourth quarter. He replied that their best player was injured in the opening minutes of the game, and, finally, Ettie’s curiosity ran out.

When Ettie’s father explains the team’s loss by their being down fifty points at the start of the fourth quarter, he provides an explanation with high causal statistical relevance; given an intervention on the game such that the local team is down fifty points at the end of the fourth quarter, it is very likely that they will lose. However, Ettie balks at the explanation because it has very low causal history depth; we don’t get much of a story as to why the local team lost. Indeed, it is only once Ettie’s father cites more distant causal factors contributing to the team’s loss that Ettie’s curiosity is satisfied.

The goal of this paper is to formalize the desiderata that an explanation is good to the extent that it possesses causal history depth and causal statistical relevance. I then prove a representation theorem showing that a specific family of functions provides a measure of causal explanatory power that uniquely satisfies both causal history depth and causal statistical relevance, alongside some minimal ancillary desiderata.

2. Formal preliminaries

2.1. Bayesian Networks

We begin with the following definition of a causal graph:

Definition 2.1 A causal graph is a pair ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ , where ${\cal V}$ is a set of random variables that are each measurable with respect to a common probability space ${\cal P} = \left( {{\rm{\Omega }},{\cal A},{\rm{Pr}}} \right)$ , and ${\cal R}$ is an acyclic set of ordered pairs of elements of ${\cal V}$ , usually represented pictorially as arrows from one random variable to another.

The fundamental idea behind the Bayes nets approach to representing causal structure is that if there is a chain of arrows from one variable to another, then the first variable is causally relevant to the second. So, for instance, in an epidemiological causal graph there might be a chain of arrows from a variable representing whether or not a patient smokes to a variable representing whether or not the patient develops lung cancer, thus encoding the claim that smoking causes lung cancer.

If there is an arrow from one variable to another, then we say that the first variable is parent of the second, and the second variable is a child of the first. We can then define the ancestor and descendant relations as the transitive closure of the parent and child relations, respectively. We are now in a position to define the all-important Markov condition:

Definition 2.2 A probability distribution ${\rm{Pr}}$ is Markov with respect to a graph ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ , where all variables in ${\cal V}$ are measurable with respect to some probability space ${\cal P} = \left( {{\rm{\Omega }},{\cal A},{\rm{Pr}}} \right)$ , if and only if, according to ${\rm{Pr}}$ , each ${\bf{X}} \subseteq {\cal V}$ is independent of any subset of the set of non-descendants of ${\bf{X}}$ in ${\cal G}$ , conditional on its parents in ${\cal G}$ .

The Markov condition ensures that once we know the value taken by the direct causes of some variable set ${\bf{X}}$ , information about the values taken by any non-effects of ${\bf{X}}$ are uninformative with respect to the probability that ${\bf{X}}$ takes any value. This reflects the intuitive condition that once we know the direct causes of ${\bf{X}}$ , information about more distant causes of ${\bf{X}}$ , or about other phenomena not causally related to ${\bf{X}}$ , should not be relevant for making predictions about ${\bf{X}}$ .

Finally, we are in a position to define a Bayesian network:

Definition 2.3 A Bayesian network (or “Bayes net”) is a pair $\left( {{\cal G},{\rm{Pr}}} \right)$ such that ${\cal G}$ is a graph in which all variables in ${\cal V}$ are measurable with respect to some probability space ${\cal P} = \left( {{\rm{\Omega }},{\cal A},{\rm{Pr}}} \right)$ , no variable is an ancestor of itself (i.e., the graph is acyclic), and ${\rm{Pr}}$ is Markov to ${\cal G}$ .

The core idea of the theory of causal Bayes nets is that, for the reasons given above, the causal structure of any system can be represented as a Bayes net $\left( {{\cal G},{\rm{Pr}}} \right)$ . To illustrate, consider the simple causal graph $X \to Y \to Z \leftarrow W$ . If this graph can be paired with the probability distribution ${\rm{Pr}}$ in order to form a Bayes net, then it must be the case that, according to ${\rm{Pr}}$ , $X$ is unconditionally independent of $W$ , $Y$ is independent of $W$ conditional on $X$ , $Z$ is independent of $X$ conditional on $Y$ and $W$ , and $W$ is unconditionally independent of $X$ and $Y$ . These independence claims are individually necessary and jointly sufficient for ${\rm{Pr}}$ being Markov to the graph.

2.2. Intervention distributions

Representing the causal structure of a system as a Bayes net allows us to calculate the probability distribution over a variable in that Bayes net, given an intervention on the system. An intervention is an exogenous setting of the values of one or more variables in the Bayes net that does not depend on values taken by any of the other variables. To see how this works, let us begin with a result from Pearl (Reference Pearl2000, 15–16), who proves that if ${\cal V} = \left\{ {{V_1}, \ldots, {V_m}} \right\}$ is the set of variables in a Bayes net, and if each variable in ${\cal V}$ has a corresponding value ${v_1}, \ldots, {v_m}$ , and if ${\rm{par}}\left( {{V_i}} \right)$ is the vector of values taken by the set of parents of a variable ${V_i}$ in the Bayes net, then the probability ${\rm{Pr}}\left( {{v_1}, \ldots, {v_m}} \right)$ can be factorized as follows:

(1) $${\rm{Pr}}\left( {{v_1}, \ldots, {v_m}} \right) = \mathop \prod \limits_{i = 1}^m {\rm{Pr}}({v_i}\,|\,{\rm{par}}\left( {{V_i}} \right)).$$

Next, suppose that we intervene on a set of variables ${\bf{X}} \subseteq {\cal V}$ , setting it to the set of values ${\bf{x}}$ . Pearl (Reference Pearl2000, 30) and Spirtes et al. (Reference Spirtes, Glymour and Scheines2000, 51) show that in a Bayes net, the interventional conditional probability ${\rm{Pr}}({v_1}, \ldots, {v_m}\,|\,do\left( {\bf{x}} \right))$ can be obtained using the following truncated factorization:

(2) $${\rm{Pr}}({v_1}, \ldots, {v_m}\,|\,do\left( {\bf{x}} \right)) = \mathop \prod \limits_{i = 1}^m {\rm{P}}{{\rm{r}}_{do\left( {\bf{x}} \right)}}({v_i}\,|\,{\rm{par}}\left( {{V_i}} \right)),$$

where each probability ${\rm{P}}{{\rm{r}}_{do\left( {\bf{x}} \right)}}({v_i}\,|\,{\rm{par}}\left( {{V_i}} \right))$ is defined as follows:

(3) $${\rm{P}}{{\rm{r}}_{do\left( {\bf{x}} \right)}}\left( {{v_i}{\mkern 1mu} {\rm{|}}{\mkern 1mu} {\rm{par}}\left( {{V_i}} \right)} \right) = \left\{ {\matrix{ {{\rm{Pr}}({v_i}{\mkern 1mu} |{\mkern 1mu} {\rm{par}}\left( {{V_i}} \right))} \hfill & {{\rm{if}}\;{V_i}\; \notin \;{\bf{X}},} \hfill \cr 1 \hfill & {{\rm{if}}\;{V_i} \in {\bf{X}}\;{\rm{and}}\;{v_i}\;{\rm{consistent}}\;{\rm{with}}\;{\bf{x}},} \hfill \cr 0 \hfill & {{\rm{otherwise}}.} \hfill \cr } } \right.$$

Put another way, if we intervene on some set of variables ${\bf{X}} \subseteq {\cal V}$ in a Bayes net, then we make it the case that the values of the variables in ${\bf{X}}$ no longer depend on their parents, but instead depend solely on the intervention. This can be represented graphically by a sub-graph in which all arrows into all variables in ${\bf{X}}$ are removed. This sub-graph is called the pruned sub-graph for an intervention on ${\bf{X}}$ . Spirtes et al. (Reference Spirtes, Glymour and Scheines2000) prove that ${\rm{P}}{{\rm{r}}_{do\left( {\bf{x}} \right)}}$ will be Markov to this pruned sub-graph of $\left( {{\cal G},{\rm{Pr}}} \right)$ , so that we can calculate the joint probability distribution over the pruned sub-graph created by any intervention on any set of variables ${\bf{X}}$ , using equation (2).

2.3. Causal distance

Next, we define a measure of causal history depth, in the context of a given Bayes net. Let us begin with some graph-theoretic terminology.

Definition 2.4 For any two variables $X$ and $Y$ in a graph ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ , a directed path from $X$ to $Y$ is a set of edges $\left\{ {{R_1}, \ldots, {R_n}} \right\}$ such that:

  1. i. Each ${R_i}$ in the set is an element of ${\cal R}$ ;

  2. ii. ${R_1} = \left( {X,{V_j}} \right)$ , where ${V_j} \in {\cal V}$ ;

  3. iii. ${R_n} = \left( {{V_k},Y} \right)$ , where ${V_k} \in {\cal V}$ ; and

  4. iv. there exists a sequence of distinct variables $\left( {{V_1}, \ldots, {V_{n + 1}}} \right)$ such that, for each ${R_i}$ in the path, ${R_i} = \left( {{V_i},{V_{i + 1}}} \right)$ .

Pictorially, there is a directed path from $X$ to $Y$ in a graph if one can follow the edges of the graph to “travel” from $X$ to $Y$ , moving with the direction of the edges, without passing through the same variable more than once. To illustrate, in the graph $X \to Y \to Z \leftarrow W$ , there is a directed path from $X$ to $Z$ , but not from $X$ to $W$ .

Using the cardinalities of directed paths between variables, we define a proximity measure on the variables in a graph in two steps.

Definition 2.5 For any causal graph ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ , the causal distance ${\delta _{\cal G}}\left( {X,Y} \right)$ between two variables $X \in {\cal V}$ and $Y \in {\cal V}$ is the cardinality of the directed path from $X$ to $Y$ with minimal cardinality, if such a directed path exists. If no such path exists, then ${\delta _{\cal G}}\left( {X,Y} \right) = {\rm{max}}\left\{ {{\delta _{\cal G}}\left( {{V_i},{V_j}} \right):{V_i},{V_j} \in {\cal V}} \right\}$ .

Definition 2.6 For any causal graph ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ , the normalized causal proximity ${\pi _{\cal G}}\left( {{\bf{X}},{\bf{Y}}} \right)$ takes as its arguments any two sets ${\bf{X}} \subseteq {\cal V}$ and ${\bf{Y}} \subseteq {\cal V}$ , and is defined as follows:

$${\pi _{\cal G}}\left( {{\bf{X}},{\bf{Y}}} \right) = {{{\rm{max}}\left\{ {{\delta _{\cal G}}\left( {{V_i},{V_j}} \right):{V_i},{V_j} \in V} \right\} - {\rm{max}}\left\{ {{\delta _{\cal G}}\left( {X,Y} \right):X \in {\bf{X}},Y \in {\bf{Y}}} \right\}} \over {{\rm{max}}\left\{ {{\delta _{\cal G}}\left( {{V_i},{V_j}} \right):{V_i},{V_j} \in V} \right\}}}.$$

In other words, ${\pi _{\cal G}}\left( {{\bf{X}},{\bf{Y}}} \right)$ returns the normalized difference between the length of the longest shortest directed path between any two variables in the graph ${\cal G}$ and the length of the longest shortest path between a variable in ${\bf{X}}$ and a variable in ${\bf{Y}}$ . The result is a measure of proximity that approaches one as the longest shortest path between a variable in ${\bf{X}}$ and a variable in ${\bf{Y}}$ gets shorter in length, and approaches zero as the longest shortest path between a variable in ${\bf{X}}$ and a variable in ${\bf{Y}}$ becomes longer. To illustrate, in the graph $X \to Y \to Z \leftarrow W$ , ${\pi _{\cal G}}\left( {\left\{ X \right\},\left\{ W \right\}} \right) = 0$ , ${\pi _{\cal G}}\left( {\left\{ {Y,W} \right\},\left\{ Z \right\}} \right) = .5$ , and ${\pi _{\cal G}}\left( {\left\{ {X,Y} \right\},\left\{ Z \right\}} \right) = 0$ . As it will occasionally be more convenient to speak in terms of normalized causal distance rather than normalized causal proximity, we define a normalized causal distance function ${{\rm{\Delta }}_{\cal G}}\left( {{\bf{X}},{\bf{Y}}} \right)$ :

Definition 2.7 For any causal graph ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ , the normalized causal distance ${{\rm{\Delta }}_{\cal G}}\left( {{\bf{X}},{\bf{Y}}} \right)$ is given by the equation ${{\rm{\Delta }}_{\cal G}}\left( {{\bf{X}},{\bf{Y}}} \right) = 1 - {\pi _{\cal G}}\left( {{\bf{X}},{\bf{Y}}} \right)$ .

3. The representation theorem

In this primary section of the paper, I make good on my promise in the introduction to state a set of desiderata that formalize those cases in which explanatory power requires a trade-off between causal history depth and predictive power, and then prove that a specific family of measures uniquely satisfies these desiderata. In several respects, my proposed desiderata are adapted from those proposed by Schupbach and Sprenger (Reference Schupbach and Sprenger2011), but with adaptations made so as to incorporate causal history depth and intervention distributions, neither of which Schupbach and Sprenger consider.

I begin by stating three ancillary desiderata for such a measure. The first is as follows:

D1 (Formal structure) For any Bayes net $\left( {{\cal G},{\rm{Pr}}} \right)$ where the graph ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ is such that each variable in ${\cal V}$ is measurable with respect to the probability space ${\cal P} = \left( {{\rm{\Omega }},{\rm{\Sigma }},{\rm{Pr}}} \right)$ , ${\theta _{{\cal P},{\cal G}}}$ is a function from any two sets of values ${\bf{e}}$ and ${\bf{c}}$ of any two sets of variables ${\bf{E}} \subseteq {\cal V}$ and ${\bf{C}} \subseteq {\cal V}$ to a real number ${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right) \in \left[ { - 1,1} \right]$ that can be represented as a function of ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))$ , ${\rm{Pr}}\left( {\bf{e}} \right)$ , and ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ .

This desideratum ensures that ${\theta _{{\cal P},{\cal G}}}$ takes as input: (i) the fact that a set of effect variables ${\bf{E}}$ takes a set of values ${\bf{e}}$ , and (ii) the fact that a set of causal variables ${\bf{C}}$ takes a set of values ${\bf{c}}$ and returns a value between $ - 1$ and $1$ representing the power with which the fact that ${\bf{C}} = {\bf{c}}$ explains the fact that ${\bf{E}} = {\bf{e}}$ . Moreover, this value is determined solely by the following quantities: (i) the probability that ${\bf{E}} = {\bf{e}}$ given an intervention setting ${\bf{C}}$ to ${\bf{c}}$ , (ii) the marginal probability that ${\bf{E}} = {\bf{e}}$ , and (iii) the normalized causal proximity between ${\bf{C}}$ and ${\bf{E}}$ .

Second, I introduce an additional formal constraint:

D2 (Normality and form) The function ${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right)$ is a ratio of two functions of ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))$ , ${\rm{Pr}}\left( {\bf{e}} \right)$ , and ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ , each of which are homogeneous in their arguments to lowest possible degree $k \ge 1$ .

The requirement that the function be a ratio of two functions with the same arguments ensures that it is normalized. I follow Schupbach and Sprenger in holding that requiring that each function be homogeneous in its arguments to lowest possible degree $k \ge 1$ ensures that their measure of explanatory power is maximally simple, in a well-defined sense advocated by Carnap (Reference Carnap1950) and Kemeny and Oppenheim (Reference Kemeny and Oppenheim1952). Note that a function $f$ is homogeneous in its arguments ${x_1}, \ldots, {x_n}$ to degree $k$ if, for all $\gamma \in R$ , $f\left( {\gamma {x_1}, \ldots, \gamma {x_n}} \right) = {\gamma ^k}f\left( {{x_1}, \ldots, {x_n}} \right)$ .

Third, I introduce a desideratum aimed at capturing the idea that there is a specific zero point for any measure of explanatory power:

D3 (Neutrality) If ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) = {\rm{Pr}}\left( {\bf{e}} \right)$ , then ${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right) = 0$ .

Neutrality ensures that when an intervention setting causal variables to a particular set of values provides no information about the explanandum effect, causal explanatory power is zero.

With these three ancillary desiderata established, I move now to a formalization of causal history depth:

D4 (Causal history depth) Holding fixed the value of ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))$ and ${\rm{Pr}}\left( {\bf{e}} \right)$ , if ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) \gt {\rm Pr}\left( {\bf{e}} \right)$ , then ${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right)$ is strictly decreasing in ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ , and if ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) \lt {\rm Pr}\left( {\bf{e}} \right)$ , then ${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right)$ is strictly increasing in ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ .

This desideratum encodes the idea that, all else being equal, if an intervention setting ${\bf{C}}$ to ${\bf{c}}$ is positively statistically relevant to the event denoted by ${\bf{E}} = {\bf{e}}$ , then ${\bf{C}} = {\bf{c}}$ is explanatorily powerful to the extent that it cites causes that are more causally distant (and so less proximal) with respect to the variables in ${\bf{E}}$ . Moreover, it introduces the idea that if an intervention setting ${\bf{C}}$ to ${\bf{c}}$ is negatively statistically relevant to the event denoted by ${\bf{E}} = {\bf{e}}$ , then explanatory power is an increasing function of causal history depth (and so a decreasing function of causal proximity). This reflects the assumption that attempted explanations that cite factors that both make the event being explained less likely and are causally far removed from the event being explained are especially bad explanations.

Fifth and finally, I introduce a formalization of causal statistical relevance:

D5 (Causal statistical relevance) Holding fixed the value of ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ , the greater the degree of causal statistical relevance between ${\bf{e}}$ and ${\bf{c}}$ (defined here as the difference ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)$ ), the greater the value of ${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right)$ .

This desideratum says that the more an intervention such that ${\bf{C}} = {\bf{c}}$ makes it likely that ${\bf{E}} = {\bf{e}}$ , the greater the explanatory power of ${\bf{c}}$ with respect to ${\bf{e}}$ .

These five desiderata together determine the form of a more general measure of causal explanatory power, as established by the following representation theorem (see the appendix for a proof of this and all subsequent facts and propositions):

Proposition 3.1 Any measure ${\theta _{{\cal P},{\cal G}}}\left( {\bf e,c} \right)$ that satisfies D1D5 has the form

$${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right) = {{{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))\,-\, {\rm{Pr}}\left( {\bf{e}} \right)} \over {{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))\,+\, {\rm{Pr}}\left( {\bf{e}} \right)\,+\,\alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right)}}},\;\;\;\;{\rm{where}}\;\;\alpha \gt 0.$$

The equation for ${\theta _{{\cal P},{\cal G}}}\left( {{\bf{e}},{\bf{c}}} \right)$ can be re-written in terms of normalized causal distance as follows:

(4) $${\theta _{\cal P,G}}\left( {{\bf{e}},{\bf{c}}} \right) = {{{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))\,-\,{\rm{Pr}}\left( {\bf{e}} \right) \over {{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))\,+\,{\rm{Pr}}\left( {\bf{e}} \right)\,+\,\alpha \left [ 1\,-\,\Delta_{\cal G}\left ( {\bf{C}},{\bf{E}} \right ) \right ]}}},\;\;\;\;{\rm{where}}\;\;\alpha \gt 0.$$

This result raises the immediate question of the significance of the coefficient $\alpha $ . For a given Bayes net $\left( {{\cal G},{\rm{Pr}}} \right)$ with variable settings ${\bf{C}} = {\bf{c}}$ and ${\bf{E}} = {\bf{e}}$ , let $\phi_{{\cal P},{\cal G}}$ be a function defined as follows:

(5) $${\phi _{{\cal P},{\cal G}}}\left( {\alpha ;{\bf{e}},{\bf{c}}} \right) = {{{\mkern 1mu} |{\mkern 1mu} \partial {\theta _{{\cal P},{\cal G}}}\left( {{\bf{c}},{\bf{e}}} \right)/\partial {\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right){\mkern 1mu} |{\mkern 1mu} } \over {{\mkern 1mu} |{\mkern 1mu} \partial {\theta _{{\cal P},G}}\left( {{\bf{c}},{\bf{e}}} \right)/\partial {\rm{Pr}}({\bf{e}}{\mkern 1mu} |{\mkern 1mu} do\left( {\bf{c}} \right)){\mkern 1mu} |{\mkern 1mu} }}.$$

If we take the absolute value of the partial derivative of ${\theta _{{\cal P},{\cal G}}}$ with respect to any argument to measure the importance of that argument to the overall measure of causal explanatory power, then ${\phi _{{\cal P},{\cal G}}}$ measures the relative importance of causal proximity/distance, as compared to the statistical relevance of an intervention setting ${\bf{C}}$ to ${\bf{c}}$ , for a fixed value of ${\rm{Pr}}\left( {\bf{e}} \right)$ . Footnote 1 The following fact about ${\phi _{{\cal P},{\cal G}}}$ holds:

Fact 3.2. For any Bayes net $\left( {{\cal G},{\rm{Pr}}} \right)$ and any ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))$ , ${\rm{Pr}}\left( {\bf{e}} \right)$ , and ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ , if ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) \ne {\rm{Pr}}\left( {\bf{e}} \right)$ , then $$d{\phi _{{\cal P},{\cal G}}}\left( {\alpha ;{\bf{e}},{\bf{c}}} \right)/d\alpha > 0$$ .

Thus, increases in $\alpha $ result in increases in the relative importance of proximity/distance, as compared to causal statistical relevance, for the measure of causal explanatory power, whenever there is some causal statistical relevance, either positive or negative, between the explanans and the explanandum.

To illustrate how this measure works, let us return to Eva and Stern’s Ettie example:

Example 3.3 Consider the simple causal graph $X \to Y \to Z$ , where $X$ is a binary variable denoting whether or not the team’s best player is injured in the first half of the match ( $0$ if not injured, $1$ if injured), $Y$ is a binary variable denoting whether or not the home team is down by more than thirty points at the start of the fourth quarter ( $0$ if they are not, $1$ if they are), and $Z$ is a binary variable denoting whether or not the home team loses ( $0$ if they lose, $1$ if they do not lose). Suppose that ${\rm{Pr}}(Z = 0\,|\,do\left( {X = 1} \right)) = .8$ , ${\rm{Pr}}(Z = 0\,|\,do\left( {Y = 1} \right)) = .99$ , and ${\rm{Pr}}\left( {Z = 0} \right) = .3$ . We know that ${{\rm{\Delta }}_{\cal G}}\left( {\left\{ X \right\},\left\{ Z \right\}} \right) = 1$ and ${{\rm{\Delta }}_{\cal G}}\left( {\left\{ Y \right\},\left\{ Z \right\}} \right) = .5$ . It follows that if $\alpha \gt .456$ , then ${\theta _{{\cal P},{\cal G}}}\left( {Z = 0,X = 1} \right) \gt {\theta _{{\cal P},{\cal G}}}\left( {Z = 0,Y = 1} \right)$ .

Thus, for suitably large $\alpha $ (and so a suitably large emphasis on causal history depth as a determinant of causal explanatory power), my proposed measure of causal explanatory power can deliver verdicts in keeping with Ettie’s intuitions in this vignette.

One might object at this stage that the formalization of causal history depth presented here only tracks the degree to which an explanation cites a distant cause relative to the explanandum effect, and that this is distinct from the desideratum that an explanation fills in the full causal history of the events leading up to the explanandum effect. In response, I prove a result showing that, necessarily, the function derived above will deliver the result that the explanatory power of a causal explanation is always positively associated with the extent to which that explanation cites the full causal history of an explanandum effect.

Consider any Bayes net ${\cal G} = \left( {{\cal V},{\cal R}} \right)$ in which all variables are measurable with respect to some probability space ${\cal P}$ . Let ${\bf{E}}$ be some subset of ${\cal V}$ , and let ${\rm{Pa}}{{\rm{r}}_0}\left( {\bf{E}} \right)$ denote the parents of the variables in ${\bf{E}}$ according to ${\cal G}$ , let ${\rm{Pa}}{{\rm{r}}_1}\left( {\bf{E}} \right)$ denote the parents of the parents of the variables in ${\bf{E}}$ according to ${\cal G}$ , and so on. Let ${\rm{\Xi }}\left( n \right) = \cup _{i = 0}^n\; {\rm{Pa}}{{\rm{r}}_i}\left( {\bf{E}} \right)$ , and let $\xi \left( n \right)$ be a set of values taken by the variables in ${\rm{\Xi }}\left( n \right)$ . The following proposition holds:

Proposition 3.4 For all $n \gt 0$ , if $Pa{r_n}\left( \mathbf{E} \right)$ is non-empty, $Pa{r_n}\left( \mathbf{E} \right) \ne Pa{r_{n - 1}}\left( \mathbf{E} \right)$ , and $\mathbf{E} \cap \Xi \left( n \right) = \emptyset $ , then ${\theta _{{\cal P},{\cal G}}}\left( {\mathbf{e},\xi \left( n \right)} \right) \gt {\theta _{{\cal P},{\cal G}}}\left( {\mathbf{e},\xi \left( {n - 1} \right)} \right)$ .

This ensures that, for any set of variables ${\bf{E}}$ , we can generate a more powerful explanation of why ${\bf{E}}$ takes the value that it does by accounting for more of the causal history of the event represented by ${\bf{E}} = {\bf{e}}$ . This shows that when we stipulate as desiderata for a measure of causal explanatory power my formalizations of causal history depth and causal statistical relevance, the measure proposed here captures the idea that, all else being equal, ideal causal explanation involves a maximally perspicuous filling-in of the causal chain of events resulting in the explanandum effect, in keeping with the motivating intuition of this paper.

4. Conclusion

I conclude by first noting that my goal in this paper has not been to give a formal measure of causal explanatory power that delivers intuitive judgements in all applicable circumstances. Indeed, I take it that no all-things-considered quantitative measure of explanatory power could possibly comport with our intuitions or scientific practices in all cases. Footnote 2 Instead, my aim has been to examine specifically those cases in which the power of a causal explanation is determined by a trade-off between causal history depth and causal statistical relevance.

Even with this qualification, it could be argued that there is no context in which explanatory power is entirely determined by a trade-off between these two properties, and that instead there is always a wide array of factors that determine causal explanatory power in any given context, such that the concept of explanatory power itself never admits of formal representation. Against this line of argument, I hold that in some cases, the sole primary determinants of causal explanatory power are causal history depth and causal statistical relevance. In these cases, my measure amounts to an explication of explanatory power, in the sense of Carnap (Reference Carnap1950). That is, it takes an inherently vague, imprecise notion from the real world and renders it mathematically tractable, while still capturing something close enough to the actual determinants of our judgements of explanatory power.

Acknowledgments

I am grateful to Jonathan Birch, Luc Bovens, Christopher Hitchcock, Christian List, Katie Steele, Reuben Stern, Thalia Vrantsidis, and audiences at the LSE PhD student work-in-progress seminar, the 2018 Explanatory Power Workshop at the University of Geneva, and the Concepts and Cognition Lab at Princeton University for feedback on various drafts of this paper.

A Proofs and demonstrations

A.1 Proof of Proposition 3.1

Proof. For the sake of concision, let $x = {\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right))$ , let $y = {\rm{Pr}}\left( {\bf{e}} \right)$ , and let $z = {\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ . By D1, a measure of causal explanatory power must be a function $f\left( {x,y,z} \right)$ . We begin by searching for a function that is homogeneous in its arguments to degree $1$ , in keeping with D2. Such a function has the form

(6) $$f\left( {x,y,z} \right) = {{ax + by + cz} \over {\bar ax + \bar by + \bar cz}}.$$

D3 requires that the numerator is zero whenever $x = y$ . This is achieved by letting $a = - b$ and $c = 0$ , so that we have

(7) $$f\left( {x,y,z} \right) = {{a\left( {x - y} \right)} \over {\bar ax + \bar by + \bar cz}}.$$

Letting $x = 1$ gives us

(8) $$f\left( {x,y,z} \right) = {{a - ay} \over {\bar a + \bar by + \bar cz}}.$$

By D1, D4, and D5, as $y \to 0$ and $z \to 0$ , it must be the case that $f\left( {x,y,z} \right) \to 1$ . This requires that $a = \bar a$ , so that we have

(9) $$f\left( {x,y,z} \right) = {{a\left( {x - y} \right)} \over {ax + \bar by + \bar cz}}.$$

Next, let $x = 0$ so that we have

(10) $$f\left( {x,y,z} \right) = {{ - ay} \over {\bar by + \bar cz}}.$$

By D1, D4, and D5, as $y \to 1$ and $z \to 0$ , it must be the case that $f\left( {x,y,z} \right) \to - 1$ . This requires that $\bar b = a$ , so that we have

(11) $$f\left( {x,y,z} \right) = {{a\left( {x - y} \right)} \over {a\left( {x + y} \right) + \bar cz}}.$$

It remains to determine the sign of $a$ and $\bar c$ . Let $x = 1$ and $y = 0$ , so that

(12) $$f\left( {x,y,z} \right) = {a \over {a + \bar cz}}.$$

If $\bar c \lt 0$ , then $f\left( {x,y,z} \right) \gt 1$ for positive $z$ , in violation of D1. Thus, $\bar c \ge 0$ . Moreover, it must be the case that $\bar c \gt 0$ for D4 to hold in general. Next, let $x = 0$ and $y = 1$ , so that

(13) $$f\left( {x,y,z} \right) = {{ - a} \over {a + \bar cz}}.$$

If $a \lt 0$ , then $f\left( {x,y,z} \right) \lt - 1$ for some $\bar c \gt 0$ , in violation of D1. Thus, $a \ge 0$ . Moreover, it must be the case that $a \gt 0$ for D5 to hold in general. Letting $\alpha = \bar c/a$ , we arrive at the function

(14) $$f\left( {x,y,z} \right) = {{x - y} \over {x + y + \alpha z}},$$

or

(15) $${\theta _{\cal P,G}}\left( {{\bf{e}},{\bf{c}}} \right) = {{{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)} \over {{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) + {\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right)}}},$$

where $\alpha \gt 0$ .

A.2 Demonstration of Fact 3.2

Proof. We proceed by expanding the function ${\phi _{{\cal P},{\cal G}}}$ :

$${\phi _{{\cal P},{\cal G}}}\left( {\alpha ;{\bf{e}},{\bf{c}}} \right) = {{{\mkern 1mu} |{\mkern 1mu} \partial {\theta _{{\cal P},{\cal G}}}\left( {{\bf{c}},{\bf{e}}} \right)/\partial {\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right){\mkern 1mu} |{\mkern 1mu} } \over {{\mkern 1mu} |{\mkern 1mu} \partial {\theta _{{\cal P},{\cal G}}}\left( {{\bf{c}},{\bf{e}}} \right)/\partial {\rm{Pr}}({\bf{e}}{\mkern 1mu} |{\mkern 1mu} do\left( {\bf{c}} \right)){\mkern 1mu} |{\mkern 1mu} }}.$$
(16) $$= {{\,|\, - \alpha [{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)]/{{({\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) + {\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right))}^2}\,|\,} \over {\,|\,2{\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right)/{{({\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) + {\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right))}^2}\,|\,}}}}}.$$

Since all terms are positive, if ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) \gt Pr\left( {\bf{e}} \right)$ , then we have

(17) $${\phi _{\cal P,G}}\left( {\alpha ;{\bf{e}},{\bf{c}}} \right) = {{\alpha [{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)]} \over {2{\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right)}}},$$

in which case

(18) $${{d{\phi _{\cal P,G}}\left( {\alpha ;{\bf{e}},{\bf{c}}} \right)} \over {d\alpha }} = {{2{\rm{Pr}}\left( {\bf{e}} \right)[{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)]} \over {{{(2{\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right))}^2}}}} \,\gt\, 0.$$

If ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) \lt Pr\left( {\bf{e}} \right)$ , then we have

(19) $${\phi _{\cal P,G}}\left( {\alpha ;{\bf{e}},{\bf{c}}} \right) = {{ - \alpha [{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)]} \over {2{\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right)}}},$$

in which case

(20) $${{d{\phi _{\cal P,G}}\left( {\alpha ;{\bf{e}},{\bf{c}}} \right)} \over {d\alpha }} = {{ - 2{\rm{Pr}}\left( {\bf{e}} \right)[{\rm{Pr}}({\bf{e}}\,|\,do\left( {\bf{c}} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)]} \over {{{(2{\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {{\bf{C}},{\bf{E}}} \right))}^2}}}} \, \gt \, 0 .$$

Thus, the fact holds in either case.

A.3 Proof of Proposition 3.4

Proof. Since ${\rm{Pa}}{{\rm{r}}_n}\left( {\bf{E}} \right) \ne {\rm{Pa}}{{\rm{r}}_{n - 1}}\left( {\bf{E}} \right)$ , we know that there is at least one $X \in {\rm{Pa}}{{\rm{r}}_n}\left( {\bf{E}} \right)$ such that ${\delta _{\cal G}}\left( {X,E} \right) \gt {\delta _{\cal G}}\left( {Y,E} \right)$ for any $E \in {\bf{E}}$ and any $Y \in {\rm{Pa}}{{\rm{r}}_{n - 1}}\left( {\bf{E}} \right)$ . This entails that

$${\rm{max}}\left\{ {{\delta _{\cal G}}\left( {X,E} \right):X \in {\rm{Pa}}{{\rm{r}}_n}\left( {\bf{E}} \right),E \in {\bf{E}}} \right\} \gt {\rm max}\left\{ {{\delta _{\cal G}}\left( {Y,E} \right):Y \in {\rm{Pa}}{{\rm{r}}_{n - 1}}\left( {\bf{E}} \right),E \in {\bf{E}}} \right\},$$

which entails in turn that

$${\rm{max}}\left\{ {{\delta _{\cal G}}\left( {X,E} \right):X \in {\rm{\Xi }}\left( n \right),E \in {\bf{E}}} \right\} \gt {\rm max}\left\{ {{\delta _{\cal G}}\left( {Y,E} \right):X \in {\rm{\Xi }}\left( {n - 1} \right),Y \in {\bf{E}}} \right\},$$

and so ${\pi _{\cal G}}\left( {{\rm{\Xi }}\left( n \right),{\bf{E}}} \right) \lt {\pi _{\cal G}}\left( {{\rm{\Xi }}\left( {n - 1} \right),{\bf{E}}} \right)$ . Since ${\bf{E}}$ and ${\rm{\Xi }}\left( n \right)$ have empty intersection, we know from equation 3 that, for any ${\bf{e}}$ , ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\xi \left( n \right)} \right)) = {\rm{Pr}}({\bf{e}}\,|\,{\rm{pa}}{{\rm{r}}_0}\left( {\bf{E}} \right))$ for any $n$ , and so, for $n \gt 0$ , ${\rm{Pr}}({\bf{e}}\,|\,do\left( {\xi \left( n \right)} \right)) = {\rm{Pr}}({\bf{e}}\,|\,\,|\,do\left( {\xi \left( {n - 1} \right)} \right)) = {\rm{Pr}}({\bf{e}}\,|\,{\rm{pa}}{{\rm{r}}_0}\left( {\bf{E}} \right))$ . Together, this entails that, for any $\alpha $ ,

$$\hskip -86pt{\theta _{\cal P,G}}\left( {{\bf{E}},\Xi \left( n \right)} \right) = {{{\rm{Pr}}({\bf{e}}\,|\,do\left( {\xi \left( n \right)} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)} \over {{\rm{Pr}}({\bf{e}}\,|\,do\left( {\xi \left( n \right)} \right)) + {\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {\Xi \left( n \right),{\bf{E}}} \right)}}}$$
$${\rm{\gt}}{{{\rm{Pr}}({\bf{e}}\,|\,do\left( {\xi \left( {n - 1} \right)} \right)) - {\rm{Pr}}\left( {\bf{e}} \right)} \over {{\rm{Pr}}({\bf{e}}\,|\,do\left( {\xi \left( {n - 1} \right)} \right)) + {\rm{Pr}}\left( {\bf{e}} \right) + \alpha {\pi _{\cal G}\left( {\Xi \left( {n - 1} \right),{\bf{E}}} \right)}}}$$
$$ \hskip-130pt= {\theta _{{\cal P},{\cal G}}}\left( {{\bf{E}},{\rm{\Xi }}\left( {n - 1} \right)} \right).$$

Footnotes

1 There is a slight idealization at work here. In practice, ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ can only take rational values in the unit interval, and so the partial derivative $$\partial {\theta _{{\cal P},{\cal G}}}/\partial {\theta _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$$ is not really well-defined. However, for the purpose of calculating ${\phi _{{\cal P},{\cal G}}}$ , we treat ${\pi _{\cal G}}\left( {{\bf{C}},{\bf{E}}} \right)$ as though it can take all real values in the unit interval.

2 See Lange (Reference Lange2022) for an argument to this effect.

References

Carnap, Rudolf. 1950. Logical Foundations of Probability. Chicago, IL: University of Chicago Press.Google Scholar
Crupi, Vincenzo and Tentori, Katya. 2012. “A Second Look at the Logic of Explanatory Power (With Two Novel Representation Theorems)”. Philosophy of Science 79 (3):365–85.Google Scholar
Eva, Benjamin and Stern, Reuben. 2019. “Causal Explanatory Power”. The British Journal for the Philosophy of Science 70 (4):1029–50.Google Scholar
Keas, Michael N. 2018. “Systematizing the Theoretical Virtues”. Synthese 195 (6):2761–93.Google Scholar
Kemeny, John G. and Oppenheim, Paul. 1952. “Degree of Factual Support”. Philosophy of Science 19 (4):307–24.10.1086/287214CrossRefGoogle Scholar
Lange, Marc. 2022. “Against Probabilistic Measures of Explanatory Quality”. Philosophy of Science 89 (2):252267.Google Scholar
Lewis, David. 1986. “Causal Explanation”. In Philosophical Papers Vol. II, edited by Lewis, David, 214–40. Oxford: Oxford University Press.Google Scholar
Pearl, Judea. 2000. Causality: Models, Reasoning, and Inference. Cambridge University Press.Google Scholar
Railton, Peter. 1981. “Probability, Explanation, and Information”. Synthese 48 (2):233–56.Google Scholar
Salmon, Wesley C. 1984. Scientific Explanation and the Causal Structure of the World. Princeton, NJ Princeton University Press.Google Scholar
Schupbach, Jonah N. and Sprenger, Jan. 2011. “The Logic of Explanatory Power”. Philosophy of Science 78 (1):105–27.Google Scholar
Spirtes, Peter, Glymour, Clark, and Scheines, Richard. 2000. Causation, Prediction, and Search. Cambridge, MA: MIT Press.Google Scholar