1. Introduction
Malcolm Forster and Elliott Sober’s highly influential paper (Reference Forster and Sober1994) introduced to the philosophical literature some of the major developments of the latter half of the twentieth century statistics on the model selection problem. They argued that Akaike’s results (Reference Akaike, Petrov and Csaki1973) on how to correct for over-fitting the data sheds light on a number of topics in philosophy of science, especially the problem of explaining why simpler and less ad hoc theories have better predictions. However, Forster and Sober also posed a potential problem, which they call the “sub-family problem,” that would arise if one uses model selection criteria in an ad hoc way. This is a problem for “any proposal that measures simplicity by the paucity of adjustable parameters,” including the Akaikean one. They then offer a solution by showing how such ad hoc use of the Akaikean criterion is disallowed in the broader Akaikean framework because of a “meta-theorem” about the error in employing Akaike’s results.
Although we find the sub-family problem interesting in itself, we think it can be of deeper philosophical value by illuminating some of the conditions under which respecting simplicity-favoring considerations results in desirable inferences. Our goal in this essay is to establish the following claim.
Independent Motivation Requirement (IMR). Weighing considerations of simplicity against those of goodness-of-fit, as it is recommended by Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC), results in reliable inferences only if one’s domain of options (i.e., the set of candidate models) consists of models for the inclusion of which one has a positive reason independently of the current data. Optimizing the balance between simplicity and goodness-of-fit can lead one astray, if the models are constructed post hoc or if one liberally adds to the number of models.
IMR is violated in two ways. First, when one allows knowledge of the extant data to play a role in the design of the model itself (post hoc model construction). Second, when the number of models is unduly large because one doesn’t have a positive reason for taking some of the candidate models into consideration. In both events, simplicity-favoring considerations that are appealed to in model selection criteria cannot amend for systematic over-fitting.
We begin, in section 2, by introducing the Akaikean framework for model selection, AIC, and the sub-family problem. The problem equally applies to structurally similar model selection criteria, such as BIC. We discuss this issue and a solution for the problem in the BIC-based framework in section 3. This solution is a manifestation of our thesis that violating IMR results in untoward inferential practices. Sections 4 and 5 discuss two solutions to the problem in the Akaikean framework. In section 4, we examine Forster and Sober’s solution and argue that, although much of what they say is true, their solution isn’t fully satisfactory. In particular, it appears to follow from their solution that the AIC scores of simple models with excellent goodness-of-fit are “unreliable” (epistemically biased) estimates, even if there are reasons independent of the current data for considering those models (i.e., even if considering them as candidate models does not violate IMR). We argue that this idea is false. In section 5, we offer our own solution, according to which the problem arises because of a violation of IMR. In section 6, we talk about a case in which one can re-introduce essentially the same error involved in the sub-family problem by considering too many candidate models—and thereby violating IMR but not through post hoc construction. We will then explain why such a practice is problematic. In section 7, we conclude our argument for IMR, presenting the main ideas less technically. We end the paper by pointing out an important practical consequence of IMR. The Akaikean framework for model selection tells us why fudged hypotheses that fit the data well (i.e., best-fitting members of complex models) have poor predictions. Footnote 1 We will argue that IMR gives us a clear criterion for determining whether fixing certain parameters in an n-parameter model results in fudged models with poor predictions and must be avoided. Footnote 2
2. The Setup of the Problem
In model selection, one is concerned with the comparison of families of hypotheses (hereafter “models”). We will reserve the term “hypothesis” for individual members of models. For example, “y = 2x+3+N(0,1)” is a hypothesis belonging to the model {y=ax+b+N(0,1)}. Typically, one’s background theory specifies a finite set of candidate models prior to consulting the data and one is interested in comparing them based on the data. For example, background theory may restrict the set of plausible models to polynomials of degrees no more than 5. One’s goal can then be to find some weighted ordering of such models (here, degrees of polynomials) with respect to various desirable features, such as predictive accuracy or posterior probability.
Suppose you have a suitably large set of observational data consisting of ordered pairs of values of two variables, X and Y, generated by an unknown “true” function, T. The data might, for example, record the length of a metallic bar in different temperatures. Your background theory tells you that Xs and Ys are linearly related (though because of the existence of error, the observed values might not exactly fit a line). You want to find the particular linear function that best fits the data. This is a rather straightforward statistical problem and doesn’t involve model selection, because you are considering only one model (you know X and Y are linearly related). The standard solution on which there is consensus among statisticians is to find the line with maximum likelihood relative to the data, where the likelihood of a hypothesis is defined as the probability (or probability density) of the data conditional on the hypothesis.
A much more difficult question arises if the inference problem concerns the choice between models, say between parabolic functions and linear functions. Here, one cannot simply maximize likelihood, because to do so would almost always lead one to choose a parabolic function (or generally, a member of the largest model). More complicated models have more freedom to fit the data. Thus, the likeliest members of those models are more likely to fit the noise, as opposed to the main pattern, in the data. This is called “over-fitting.” There are various model selection techniques for how to avoid over-fitting. Most of them recommend balancing considerations of goodness-of-fit with data against considerations of simplicity, though the “optimal” balance is naturally different for different techniques, since they either pursue different goals or make different assumptions about the inference problem or both.
Before we proceed, some notational conventions must be mentioned. In order to avoid confusion, we refer to random variables by capital letters, single pieces of data by lower-case letters and data sets by bolded lower-case letters. In order to distinguish between specific data sets and data sets considered as random variables, we show the former by subscripted, bolded lower-case letters, like y 0 , and the latter by non-subscripted, bolded lower-case letters, like, y. Consider a model, F(θ), defined over a parameter space Θ. For example, if F is the family of linear functions with a normally-distributed error with mean 0 and variance 1, it can be characterized as F: {y = ax+b+N(0,1); (a,b) ∈ R2}. Here the parameter space of F is R2.Footnote 3 We can associate a likelihood function, ℒ(θ) = ℒ(θ,z) = g(z|θ), to F, where g(z |θ) is the probability density of obtaining the n-tuple data set z ={z1,z2,…,zn} [where zi is the ith observed datum (x i ,y i )] conditional on θ being the true parametric value. The value of θ for which ℒ(θ) is maximized is called the Maximum Likelihood Estimate (MLE) of F, and we denote it by $\hat \theta $ . We denote the member of F obtained by taking θ = $\hat \theta $ by L(F).Footnote 4 Note that both $\hat \theta $ and L(F) are functions of data; that is, $\hat \theta $ = $\hat \theta $ (y) and L(F) = L(F,y), where y is the data.
In the Akaikean framework, the goal is to find a plausible estimate of the predictive accuracies of various models. The predictive accuracy of a model, M, is a measure of how close, on average, the predictions of the best fitting member of M, with respect to an initial data set, are to subsequent data generated by the same generating function. Suppose you obtain a data set, y, generated by T and use it to determine L(M,y). Now obtain a new data set, x, and calculate the logarithm of its likelihood (hereafter, “log-likelihood”) of L(M,y) with respect to x. The average value—with respect to both x and y—of this log-likelihood is called the predictive accuracy of M. The following defines A(M), the predictive accuracy of model M.Footnote 5
E y (.) and E x (.) are both expectations with respect to T. The last term on the right-hand side better captures the nature of predictive accuracy, by emphasizing that it averages over the MLE.Footnote 6
Akaike showed that the AIC value of a model can be used to provide an estimate of its predictive accuracy. AIC of model F is defined thus.
logℒ( ${\rm{L}}\left( F \right)$ ) is the log-likelihood of L(F). If error is normally distributed, this becomes the familiar sum of squares of error terms of L(F). k is the dimension of the parameter space of F, which is usually equal to its number of adjustable parameters.
If one is interested in finding the model with the highest predictive accuracy, the Akaikean frameworkFootnote 7 recommends choosing the model with minimum AIC. However, as Forster and Sober observe, blindly minimizing AIC is problematic. Suppose one wishes to find the predictively most accurate hypothesis among polynomials of degrees 3 or less. One must find the family with minimum AIC and then select the likeliest member of that family. Let M i be the family of polynomials of degree i-1 with i adjustable parameters. Suppose M 3 turns out to have the lowest AIC value. Since the family of parabolic functions is embedded in the family of cubic functions, the likeliest member of the cubic family (L(M 4 )) has a better (or equally good) fit with the data than the likeliest member of the parabolic family (L(M 3 )). Now construct an ad hoc family, {L(M 4 )}, whose only member is L(M 4 ). The number of adjustable parameters in {L(M 4 )} is 0 (because it is a singleton family) and therefore, its AIC value is equal to -2logℒ(L(M 4 )). We have, AIC({L(M 4 )}) = -2logℒ(L(M 4 )) $ \le $ -2logℒ(L(M 3 )) < AIC(M 3 ), where logℒ(L(M 4 )) is the log-likelihood of the best fitting member of M4. Indeed, {L(M 4 )} has the lowest possible AIC value (relative to the extant data) among all families the members of which are restricted to polynomials of degrees 3 or less. Thus, if we blindly minimize AIC scores, we must choose {L(M 4 )} (hereafter, “the sub-family model”) as the predictively most accurate model, which is tantamount to choosing the likeliest hypothesis at our disposal and giving no weight to simplicity. This is what Forster and Sober have called the sub-family problem.Footnote 8
In order to show the importance of IMR, we will contrast the sub-family problem with another inference problem, which is very similar but differs in only one salient way: in that problem, IMR is respected. Suppose you have reasons independently of the extant data (e.g., theoretic reasons) for including the singleton model {5x3+6x2+2x+7+N(0,1)} among your candidate models. Then you obtain a data set, y 0 , and you observe that L(M 4 ,y 0 )=5x3+6x2+2x+7+N(0,1).Footnote 9 This inference problem, which we will call the singleton family problem,Footnote 10 is very similar to the sub-family problem. The set of candidate models (M 1 ,M 2 ,M 3 ,M 4, {L(M 4 ,y 0 )}) and the data (y 0 ) are the same. However, they have an important difference: here IMR is respected because you had reasons to include {5x3+6x2+2x+7+N(0,1)} (which happens to be identical with {L(M 4 ,y 0 )}) among your candidate models. As we shall see, this makes a big difference.
3. BIC
In the BIC-based framework, the average likelihood of a modelFootnote 11 is estimated by −½ times exponential of its BIC defined as follows,
where n is the number of data points. Lower BIC scores are better. Since this is a Bayesian framework, its ultimate goal is to determine the posterior probabilities of candidate models. It is customary—though by no means necessary—to assign equal prior probabilities to all candidate models. Thus, the most probable model is often the one with the lowest BIC value. Now, the sub-family model has the lowest possible BIC value, since it has no adjustable parameters and its only member has maximum log-likelihood. Thus, we are faced with the sub-family problem: the sub-family model appears to be the most probable model.
A referee suggests that some of the general problems for the BIC-based framework might complicate our solution to the sub-family problem. Thus, before offering our solution, we will briefly mention one of those problems, which is most relevant. If the models are nested (as in our example of polynomial models), then larger models entail smaller ones. (The set of polynomials of degree n contains the set of polynomials of degree n-1.) It follows that P(M n ) ≥ P(M n-1 ), no matter what the data is. If so, it is difficult, in this framework, to make sense of the fact that scientists sometimes prefer smaller models to larger ones. To the best of our knowledge, there has been no fully satisfactory response to this problem. Forster and Sober (Reference Forster and Sober1994) discuss the following way to address this difficulty. Instead of M 2 , construct M 2 * = M 2 - M 1 (and so on for larger models), so that no model entails any other. Then compare those newly-formulated models. Forster and Sober find this maneuver unsatisfactory, because it changes the subject. The question was why scientists prefer M 1 to M 2 , not M 1 to M 2 *. This is a fair objection, a proper reply to which (if it exists) goes beyond the scope of this essay. Here we use this mathematical maneuver to offer a solution for the sub-family problem, but we don’t claim to offer a solution for the above problem or any other general problem for BIC. All we wish to establish is that if BIC can be rescued from the general difficulties it faces, it won’t be further subject to the sub-family problem.
Construct the non-nested models (M 1 -M 4 *) as described above. Make sure that the prior probability functions over the members of your non-nested models do not have probability masses. That is, if π4( ${\rm{\theta }}$ ) is the prior probability function over the members of M 4 *, then make sure for no single value of ${\rm{\theta }}$ , π4( ${\rm{\theta }}$ ) > 0. (If this is not the case, BIC is not a good approximation of average likelihood. This is another limitation of BIC, again, independently of the sub-family problem.) How can you make sure this is the case? If for ${{\rm{\theta }}_1}$ , π4( ${{\rm{\theta }}_1}$ ) > 0, construct the singleton model { ${{\rm{\theta }}_1}$ }, where P({ ${{\rm{\theta }}_1}$ }) = π4( ${{\rm{\theta }}_1}$ ). Then redefine M 4 * in the following way, M 4 * new = M 4 * old -{ ${{\rm{\theta }}_1}$ }, with P(M 4 * new ) = P(M 4 * old ) - P({ ${{\rm{\theta }}_1}$ }). Once this is done, BIC can be used as an estimate of average likelihood of the non-nested models. Now suppose we obtain a data set y 0 and {L(M 4 ,y 0 )} = {5x3+6x2+2x+7+N(0,1)}. There are two possibilities. Either P({5x3+6x2+2x+7+N(0,1)}) > 0, or P({5x3+6x2+2x+7+N(0,1)}) = 0, where P(.) is the prior probability function.
If P({5x3+6x2+2x+7+N(0,1)}) > 0, then you had a positive reason to include {5x3+6x2+2x+7+N(0,1)} among your candidate models independently of the dataFootnote 12 and this is an instance of the singleton family problem—considering ({5x3+6x2+2x+7+N(0,1)}) does not violate IMR. The likelihood of a model is proportional to the exponential of −½BIC and since {5x3+6x2+2x+7+N(0,1)} has an exceptionally low BIC, its likelihood will be massively higher than other models. Thus, unless its prior probability is extremely lower than other models, it will be by far the most probable model among the non-nested ones. This is a welcome result. If one had theoretic reasons that a single parameter value has positive probability, and one subsequently learns that this single hypothesis fits data excellently well, the data provides very powerful evidence for that hypothesis and ought to make one significantly more confident of its truth. Again, this doesn’t solve the above-mentioned general difficulty about BIC. In the singleton family problem, {5x3+6x2+2x+7+N(0,1)} might end up having a higher posterior probability than M 4 * but it can never have a higher probability than M 4 .
The sub-family problem corresponds to the case where P({5x3+6x2+2x+7+N(0,1)}) = 0. Here you have no reason to include {5x3+6x2+2x+7+N(0,1)} among the candidate models independently of the data. Thus, to choose {5x3+6x2+2x+7+N(0,1)} because of its excellent BIC score involves violating IMR. The BIC-based framework disallows choosing {5x3+6x2+2x+7+N(0,1)}, because if P({5x3+6x2+2x+7+N(0,1)})=0, no matter how likely {5x3+6x2+2x+7+N(0,1)} is (no matter how good its BIC score is), its posterior probability will remain zero. As we shall see, although the Akaikean framework doesn’t appeal to model priors, it equally presupposes IMR.
4. Forster and Sober’s Solution
For large data sets, AIC is an approximately unbiased estimator of predictive accuracy. An estimator is statistically unbiased if its expected value equals the value it estimates. The following equation expresses this fact.
where y is a random data set. In this paper, we are not concerned with the approximate nature of this equation. So we will talk about the “unbiasedness” of AIC for convenience.
Forster and Sober argue that the AIC of the sub-family model is an unbiased estimator of its predictive accuracy, but statistical unbiasedness is not the only criterion by which to judge an estimate. They introduce another such criterion called “epistemic unbiasedness” by the following example. Consider a simple measurement of an object’s mass with a kitchen scale. Normally, the measured value is an unbiased estimator of the actual value, because it is just as likely to over-measure the mass by a given amount as it is to under-measure it by that same amount.
But now suppose that we modify this estimate by adding +10 or -10 depending on whether a fair coin lands heads or tails, respectively. Suppose that the measured value of mass was 7 kg, and the fair coin lands heads. Then the new estimate is 17 kg. Surprisingly, this new estimate is also a statistically unbiased estimate of the true mass! The reason is that in an imagined series of repeated instances, the +10 will be subtracted as often as it is added, so that the value of the average value of the modified estimate will still be equal to the true mass value. However, we know that the modified estimate is an overestimate in this instance, because we know that the coin landed heads. If the coin had landed tails, then the estimate would have been -3 kg, and would have been known to be an underestimate. In either case, we say that the modified estimate is epistemically biased. (Forster and Sober Reference Forster and Sober1994, 19)
It is helpful to make a distinction between an “estimator” and an “estimate” here. An estimator is a function of data that yields individual estimates. Statistical unbiasedness is a feature of an estimator, while epistemic bias is a feature of an individual estimate. Forster and Sober argue that AIC of the sub-family model is statistically unbiased (qua estimator) but epistemically biased (qua estimate).Footnote 13 They argue that the AIC of the sub-family model is not a good estimate because it is epistemically biased. To show this, they appeal to the following “meta-theorem” about AIC.
Error[Estimated(A(F))] =df A(F) – (−½)AIC(F) = Residual Fitting Error + Common Error + Sub-family Error (Forster and Sober Reference Forster and Sober1994, 19).
This theorem concerns the error involved in taking −½AIC as an estimate of predictive accuracy. Forster and Sober argue that the first two terms on the right-hand side are both statistically and epistemically unbiased, but the third term, although statistically unbiased, is sometimes epistemically biased, which (given the epistemic unbiasedness of the other two terms) makes the total error sometimes epistemically biased. They further argue that an important occasion in which this happens is in the sub-family problem. Thus, a fuller understanding of the Akaikean framework, which includes this meta-theorem, dissolves the sub-family problem.
Here is why Forster and Sober believe the sub-family error is epistemically biased for the sub-family model. Suppose we embed the parameter spaces of all our models in a larger parameter space (call it K) that contains the truth, T. Forster and Sober state that this space can be considered as a vector space in such a way that i) the closer a point in this space is to truth, the higher its predictive accuracy; and ii) the sub-family error is equal to the scalar product of the following two vectors in this space: the vector that goes from T to the likeliest hypothesis in K, L(K), (i.e., $\overrightarrow {T.{\rm{L}}\left( K \right)} $ ) and the vector $\overrightarrow {T.{{\rm{\theta }}_0}} $ that goes from T to the (unknown) predictively most accurate member of M, which we denote by θ0.Footnote 14 These vectors are shown in figure 1 below.Footnote 15
All three points are in general unknown. However, the scalar product of $\overrightarrow {T.{\rm{L}}\left( K \right)} $ and $\overrightarrow {T.{{\rm{\theta }}_0}} $ is equal to the product of their lengths multiplied by the cosine of the angle between them. Forster and Sober argue that for the sub-family model, the tips of the two vectors tend to be close. Here is their argument.
The Akaike estimate for a low dimensional family whose best fitting case is close to the data (and such families are the dangerous “pretenders,” for they “unfairly” combine high log-likelihoods with small penalties for complexity) exhibits an epistemic bias, as we now explain. The most predictively accurate hypothesis in such small families will also be close to the data, and therefore close to L(K). The danger is that the tips of the two vectors will be close together. Then the cosine factor is close to +1 and the subfamily error is large and positive.Footnote 16 (Forster and Sober Reference Forster and Sober1994, 20–21)
On Forster and Sober’s view, we shouldn’t follow the sub-family policy, because if we do, the AIC values of the models we construct, although statistically unbiased, tend to be epistemically biased (over)estimates of the predictive accuracies of these models.
Before discussing what we find potentially problematic about this argument, we would like to give a summary of what we will claim, in order to avoid confusion. When Forster and Sober say that the AIC of the sub-family model is statistically unbiased, they are referring to the fact that the estimate AIC({L(M 4 ,y 0 )},y 0 ) is an instantiation of the estimator AIC({L(M 4 ,y 0 )},y), which is—we agree—an unbiased estimator. We also agree that in the sub-family problem, AIC({L(M 4 ,y 0 )},y 0 ) is epistemically biased. However, we suggest that a better estimator (than AIC({L(M 4 ,y 0 )},y)) for determining the (de)merits of the estimate AIC({L(M 4 ,y 0 )},y 0 ) in the sub-family problem is AIC({L(M 4 ,y)},y), which is statistically biased. This is because in the sub-family problem, AIC({L(M 4 ,y 0 )},y 0 ) is more similar to other instantiations of AIC({L(M 4 ,y)},y) than to other instantiations of AIC({L(M 4 ,y 0 )},y). The exact opposite situation holds for the singleton family problem. There the statistical bias of AIC({L(M 4 ,y 0 )},y) provides more relevant information (than that of AIC({L(M 4 ,y)},y)) for deciding how good the estimate AIC({L(M 4 ,y 0 )},y 0 ) is. Therefore, we will argue that the Akaikean framework treats the sub-family problem and the singleton family problem differently.
Now we will unpack this. Consider the singleton family problem first. That is, suppose we had reasons independently of y 0 for considering the model {5x3+6x2+2x+7+N(0,1)}. Then we obtained y 0 and observed that L(M 4 ,y 0 )=5x3+6x2+2x+7+N(0,1). Thus, including {5x3+6x2+2x+7+N(0,1)} among the models doesn’t violate IMR. Is AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) an epistemically biased estimate in this case, according to Forster and Sober? We don’t know for sure because they don’t discuss this problem. However, if we are to infer from the literal meaning of their argument, they would have to say “yes,” because in every relevant respect to their solution, the two problems are identical. In their argument, the relevant factors are the low-dimensionality of the model and its closeness to data, which are shared in the two problems. If they are committed to this idea, then this is the only point of disagreement between our account and theirs. We believe that in the singleton family problem, the exceptionally low AIC score of {5x3+6x2+2x+7+N(0,1)} is exceptionally good reason that {5x3+6x2+2x+7+N(0,1)} has high predictive accuracy. There is nothing “unfair” in the AIC score of a low-dimensional model with good fit per se. The AIC scores of such models are “too good to be true” only if the model has these features because it was designed in an ad hoc fashion to have a low AIC score.
For this to be true, something must be missing in Forster and Sober’s account of the sub-family error as applied to the singleton family problem. Here is what is missing. They offer a consideration (hereafter consideration1) that ${{\rm{\theta }}_0}$ tends to be close to L(K), which leads to the sub-family error for {5x3+6x2+2x+7+N(0,1)} being large and positive. However, there is a competing consideration (hereafter, consideration2) that mitigates the effect of consideration1: for models with low AIC scores, ${{\rm{\theta }}_0}$ tends to be close to T. (A low AIC score means high predictive accuracy, which means closeness to truth.) Consideration2 is a reason for sub-family error to be small, because the error is equal to the product of the lengths of the two vectors $\overrightarrow {T.{\rm{L}}\left( K \right)} $ and $\overrightarrow {T.{{\rm{\theta }}_0}} $ times the cosine of the angle between them. Consideration2 is a reason that the length of $\overrightarrow {T.{{\rm{\theta }}_0}} $ is small. In both the sub-family problem and the singleton family problem, AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) is exceptionally low, which is a reason to think that 5x3+6x2+2x+7+N(0,1) is close to truth. This is the case, unless one otherwise knows that AIC({5x3+6x2+2x+7+N(0,1)}) is not a good estimate of the predictive accuracy of {5x3+6x2+2x+7+N(0,1)}. We will argue in the next section that in the sub-family problem, one knows this independently. Thus, consideration2 is irrelevant to the sub-family problem and AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) is epistemically biased in that problem. However, our argument in the next section doesn’t apply to the singleton family problem. In that problem, we are left with two competing considerations bearing on how good an estimate AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) is, and in general we have no way of comparing the relative strengths of these considerations. Notice that consideration2 doesn’t tell us anything about the sign of the sub-family error (whether it is negative or positive). Thus, we must expect to have a positive epistemic error but smaller in absolute value relative to the sub-family problem.
The amount of bias matters a lot. In both the sub-family problem and the singleton family problem, AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) is significantly better than the AIC of other models. For example, AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) =AIC(M 4 ,y 0 ) – 4,Footnote 17 and 4 units of AIC difference is usually a big difference. (To have an intuitive idea for why this is the case, consider the fact that if the data is perfectly linear, M 2 will have only a 2 units AIC advantage relative to M 4 .) In the sub-family problem, the average sub-family error is exactly 4, which means that the entire difference in AIC score (between ({5x3+6x2+2x+7+N(0,1)} and M 4 ) is due to error. However, in the singleton family problem, little of the AIC difference between {5x3+6x2+2x+7+N(0,1)} and M 4 (which is 4 units) is due to epistemic error. Therefore, {5x3+6x2+2x+7+N(0,1)} is in great shape relative to M 4 . This is a welcome result. If you have a singleton model among the candidate models for reasons independently of the extant data, and it fits the data as precisely as the family of cubic functions, you ought to be very confident of its predictive accuracy even if its AIC score is slightly biased.
The following numerical example can help illustrate our point. Here all the data is generated by the function T= 0.2x5–0.2x4–3x3+x2–1+N(0,1). Figure 2a depicts the sub-family error of the model {L(M 4 ,y)} for 107 values of y each consisting of 100 data points.Footnote 18 That is, for each data set, y, L(M 4 ,y) is determined separately and then its corresponding sub-family error is calculated.Footnote 19 This is essentially a 107 repetition of the sub-family problem. Evidently, the sub-family error of {L(M 4 ,y)} tends to be positive and large, as Forster and Sober rightly argue. Figure 2b depicts the sub-family error of the fixed singleton model M 0 :{–3x3+x2–1+N(0,1)}, which is chosen because it is close to T. The sub-family error of this fixed model is statistically unbiased, again as observed by Forster and Sober.
The difficult part is how to simulate the singleton family problem, because in that problem you must have a singleton family for reasons independently of the data and then something amazing happens: the only member of that model turns out to be equal to the best fitting member of your largest model. (Even if your singleton family contains the truth, this is unlikely to happen, unless the data is huge.) However, we can approximate this situation. Figure 3a is the histogram of the distance between M 0 and {L(M 4 ,y)} for the 107 data sets. Instead of looking at cases where {–3x3+x2–1+N(0,1)}={L(M 4 ,y)}, we first looked at cases in which M 0 is “close” to {L(M 4 ,y)}. This corresponds to cases where the member of the singleton family is not exactly identical with L(M 4 ,y) but is close to it (both in the parameter space and in terms of log-likelihood). In 3b, the sub-family error for M 0 is depicted only for those cases in which ||M 0 -L(M 4 ,y)||< 0.3 (cases that are left of the red line in 3a). The choice of 0.3 is arbitrary. We chose this value so that we can still have a large number of cases. (Smaller values result in larger errors because for them consideration1 becomes stronger, but due to the small number of cases, the histograms become very jagged.)
The average sub-family error for cases depicted in figure 3b is 0.4286. When we took this condition to the limit (i.e., ||M 0 -L(M 4 ,y)||→0), the error approached a value slightly less than 1. A comparison between 2a and 3b shows the mitigating effect of consideration2. In 2a, the average L(M 4 ,y) is not particularly close to T; thus, the only relevant consideration is consideration1. The average sub-family error was 3.9991 in 2a.Footnote 20 In 3b, the model is still singleton and close to the data (thus consideration1 is still pertinent), but because M 0 is close to T, the average sub-family error was 0.4286, quite smaller than 3.9991. The importance of this fact can be best understood with the help of figure 4, which depicts the histogram of AIC(M 4 )-AIC(M 0 ) for those cases depicted in 3b. For ||M 0 -L(M 4 ,y)||< 0.3, the average AIC difference was 3.2197.
The important point is that as we focus on data sets for which M 0 is closer and closer to L(M 4 ,y), AIC(M 0 ) becomes smaller and smaller (better) but at the same time the sub-family error becomes larger and larger. For ||M 0 -L(M 4 ,y)||→0, (AIC(M 0 ,y)–AIC(M 4 ,y))→ -4 and the sub-family error approached 1 in this particular case. That is, if you correct for the sub-family error, AIC(M 0 ,y) is still 3 units better than that of AIC(M 4 ,y). For the average case that satisfied ||M 0 -L(M 4 ,y)||< 0.3, if you correct for the sub-family error (i.e., subtract the average sub-family error, 0.4286), you still have a better AIC value of 2.7911. This illustrates the main point of our argument. In the singleton family problem, our model is singleton and very close to the data. However, its excellent AIC score (although slightly biased) is excellent evidence that it is predictively accurate.
Objection: the difference between the sub-family problem and the singleton family problem is only a historical fact about how the model was constructed. So it cannot affect how good an estimate AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) is.
Answer: certain facts about AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) are not affected by this historical fact, including the definition of A({5x3+6x2+2x+7+N(0,1)}), the value of AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) itself and the statistical bias of the estimator AIC({5x3+6x2+2x+7+N(0,1)},y). However, this list does not exhaust all the relevant information bearing on how good an estimate AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) is. Indeed, Forster and Sober introduced the notion of epistemic bias in order to be able to account for the intuitive idea that in the sub-family problem there is something wrong with AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) as an estimate, which cannot be captured by the items in the above list. We think what is wrong with this estimate is that in the sub-family problem, our model is tailored to y 0 Footnote 21 in an ad hoc fashion in order to have an optimal AIC score. However, this is not the case in the singleton family problem. In that problem, {5x3+6x2+2x+7+N(0,1)} is a contender model for independent reasons. And that such a simple model fits y 0 so well (without us having chosen it because it fits y 0 well) is excellent reason that it is highly predictively accurate. In other words, the historical fact that constitutes the difference between the two problems bears important information about the ad hocness of AIC({5x 3 +6x 2 +2x+7+N(0,1)},y 0 ) as an estimate. Of course, this intuitive idea needs a technical explanation. That is what we will offer in the next section.
5. Our Solution
We think a better solution for the sub-family problem can be given by studying another estimator namely AIC({L(M 4 ,y)},y) (instead of AIC({L(M 4 ,y 0 )},y)) one of the instantiations of which is the estimate AIC({L(M 4 ,y 0 )},y 0 ). But how can one decide which estimator provides better information about the merits/demerits of the individual estimate? In order to answer this question, consider why statisticians study the statistical biases of estimators in the first place. Suppose a 1 is an estimate of the quantity, a. For example, a 1 might be the value one reads on a kitchen scale when one weighs an apple. Naturally, one might wish to know how good an estimate a 1 is, but usually one cannot directly talk about how good or bad a single estimate is. However, sometimes one can talk about how good, in general, the estimates obtained by the “same” procedure tend to be. That is, one can talk about various features of an estimator by considering repeated estimates obtained by the “same” procedure. Statistical bias is one such feature. An estimator whose expected value equals the value it estimates is statistically unbiased. But what exactly does this tell us about the individual estimate? Insofar as the individual estimate is similar to the other instantiations of the estimator, the expected error of the estimator (its statistical bias) contains information about how good the estimate is. Importantly, the inherent vagueness in what it means for the procedure to be the “same” makes it the case that a particular estimate can be an instantiation of more than one estimator.Footnote 22 However, it doesn’t follow that the statistical biases of those estimators bear equally valuable information on the merits of the individual estimate. If the estimate a 1 is an instantiation of two estimators A and A*, and if other instantiations of A better resemble a 1 than other instantiations of A* in ways that affect the value of the estimate, then the statistical bias of A bears more pertinent information than the statistical bias of A* on the merits/demerits of a 1 .
In ordinary inference problems, there is only one estimator one would naturally associate with an individual AIC score (qua estimate). Thus, these considerations are usually unimportant. However in the sub-family problem, things are different because not only the individual AIC score is a function of the data, but the model itself is designed on the basis of the data too. Thus, the AIC score one calculates - AIC({L(M 4 ,y 0 )},y 0 ) - is an instantiation of two estimators: AIC({L(M 4 ,y 0 )},y) and AIC({L(M 4 ,y)},y). Notice that both of these estimators can be understood as the AIC of “the sub-family model.” If you understand “the sub-family model” rigidly to refer to the model constructed after obtaining y 0 , you’ll have AIC({L(M 4 ,y 0 )},y); and if you understand it nonrigidly to refer to whichever model one constructs on the basis of data, you’ll have AIC({L(M 4 ,y)},y). The former is an unbiased estimator, as Forster and Sober correctly claim. The latter is a biased estimator, as we shall argue later. But which one gives us better information about the merits/demerits of the individual estimate AIC({L(M 4 ,y 0 )},y 0 )? In order to answer this question, consider another instantiation of each estimator, say for data set y 1 . The estimator AIC({L(M 4 ,y 0 )},y) yields AIC({L(M 4 ,y 0 )},y 1 ). This keeps the model {5x3+6x2+2x+7+N(0,1)} and calculates the AIC score of that fixed model with respect to y 1 . However AIC({L(M 4 ,y)},y) yields AIC({L(M 4 ,y 1 )},y 1 ). Here one constructs the model {L(M 4 ,y 1 )} and calculates AIC({L(M 4 ,y 1 )}) with respect to y 1 . AIC({L(M 4 ,y 0 )},y 1 ) is dissimilar to AIC({L(M 4 ,y 0 )},y 0 ) in the sub-family problem in an important respect. In AIC({L(M 4 ,y 0 )},y 1 ) the model is chosen because it fits one data set (y 0 ) well but its AIC score is calculated with respect to another data set (y 1 ). Whereas in AIC({L(M 4 ,y 0 )},y 0 ), the model is chosen because it has excellent fit with y 0 and its AIC score is calculated with respect to that same data set. Obviously, AIC({L(M 4 ,y 0 )},y 0 ) and AIC({L(M 4 ,y 1 )},y 1 ) are similar in this respect. Another way one can see this point is by considering the following question: What would have been the AIC estimate if instead of y 0 one had obtained y 1 in the sub-family problem? Clearly the answer is AIC({L(M 4 ,y 1 )},y 1 ). This is the reason we believe the statistical bias of the estimator AIC({L(M 4 ,y)},y) provides better information about the estimate AIC({L(M 4 ,y 0 )},y 0 ) in the sub-family problem.Footnote 23
What about the singleton family problem? Things are quite different in that problem. What would have been the AIC estimate if instead of y 0 one had obtained y 1 in the singleton family problem? Clearly AIC({5x3+6x2+2x+7+N(0,1)},y 1 ), which happens to be equal to AIC({L(M 4 ,y 0 )},y 1 ), because here {5x3+6x2+2x+7+N(0,1)} is a contender model regardless of the fact that 5x3+6x2+2x+7+N(0,1) = L(M 4 ,y 0 ). Therefore, for this problem the estimator AIC({L(M 4 ,y)},y) is irrelevant, because we care about the singleton family {5x3+6x2+2x+7+N(0,1)}. Here the statistical bias of AIC({L(M 4 ,y 0 )},y) gives one the relevant information on how good an estimate AIC({5x3+6x2+2x+7+N(0,1)},y 0 ) (or incidentally, AIC({L(M 4 ,y 0 )},y 0 )) is.
Now we would like to show that AIC({L(M 4 ,y)},y) is in fact a biased estimator. This might appear to contradict Akaike’s results, but it doesn’t. Those results presuppose IMR according to which the set of candidate models must be determined independently of the current data. In fact, Akaike’s results about models that respect IMR help us prove that AIC({L(M 4 ,y)},y) is biased.
In order to show this, we first show that the expectation of −½(AIC({L(M 4 ,y)},y) with respect to y is larger than the predictive accuracy of any singleton family whose only member is a member of M 4 . Suppose f is a variable that ranges over the members of M. For each value of f, we can define the singleton family {f}. Also suppose m is an arbitrary member of M and {m} is the singleton family whose only member is m. By definition of L(M 4 ,y) we have,
If we take expectation with respect to y we have:
The last equality obtains because of (4). In (6), only if m = L(M 4 ) the equality holds. However, L(M 4 ) is a function of data. For any m, such that m = L(M 4 ,y i ), there is (almost always) a data set y j generated by the same generating function such that m is not the likeliest member of M 4 with respect to y j . Thus, in fact we have a stronger result than (6):
It follows from (7) that the average value of −½AIC({L(M 4 ,y),y}) is strictly larger than the predictive accuracy of any singleton family one can construct from the members of M 4 (since m is an arbitrary member of M 4 )—including, of course, {L(M 4 ,y 0 )}.
Since {L(M 4 ,y)} is a random variable and a function of data, A({L(M 4 ,y)}) is a random variable too and will vary for different data sets. This is unlike the usual application of the Akaikean framework, where the model is fixed and its predictive accuracy is a fixed number. Indeed if {L(M 4 )} was not a function of y, then by (4), −½AIC({L(M 4 )}) would have been an unbiased estimator of A({L(M 4 )}). However, regardless of the data at hand, since L(M 4 ) is a member of M4, we have A({L(M 4 ,y)}) ≤ max[(A({f}), f ∈ M 4 ]. And since (7) is true for all m, m ∈ M 4 , then we have,
That is, AIC({L(M 4 ,y)},y) is statistically biased. A comparison between equations (4) and (8) shows the difference between AIC({L(M 4 )}) and AIC of ‘normal’ models that are constructed independently of the data. In the same way that equation (4) motivates using AIC scores of ‘normal’ models as an estimate of their predictive accuracies, equation (8) shows why −½ times AIC({L(M 4 ,y 0 )},y 0 ) is not a good estimate of A({L(M 4 ,y 0 )}).
There is another way of looking at AIC({L(M 4 ,y)},y) that makes it obvious why it is a biased estimator. Since {L(M 4 ,y)}is singleton,
The last equality obtains because M 4 has 4 adjustable parameters. But by the definition of predictive accuracy (equation (1)) for model M 4 , the average predictive accuracy of {L(M 4 ,y)} (i.e., E y [A({L(M 4 )})]) must equal A(M 4 ). It follows that AIC({L(M 4 ,y)},y) is on average 4 units lower than A({L(M 4 ,y)}). The fact that the estimator AIC({L(M 4 ,y)},y) is biased gives us information about why the individual estimate AIC({L(M 4 ,y 0 )},y 0 ) is not a good estimate of predictive accuracy. This concludes our technical treatment of the sub-family problem. In the next section, we will talk about another inference problem in which IMR is violated.
7. Too Many Models
Post hoc model construction is not the only way one can violate IMR. Another way is by adding models to one’s set of candidate models without any reason for them to be considered. This simply increases the probability that a model with low predictive accuracy will have a good AIC score because it fits the current data well. An extreme version of this strategy is to add every member of the largest candidate model as a singleton model. In our working example of polynomial models, this would involve adding a singleton family {Ma,b,c,d: y = ax3+bx2+cx+d+N(0,1)} for all (a,b,c,d) ∈ R4. The number of candidate models will be infinite, but the one with the lowest AIC score is obviously {L(M 4 ,y)}. Therefore, in this case one would always choose the same model as one does in the sub-family problem. No doubt this is a problematic practice, but the question is why?
In order to see why, first consider the rather common practice of computing AIC scores for all candidate models, choosing the model with minimum AIC as the winner and taking its AIC value as an estimate of its predictive accuracy. Call this policy the minimizing policy. We can define the error involved in this policy as follows.
where M min (y) is the model with the minimum AIC score given the data y. The minimizing policy seems quite unproblematic, but here is an interesting fact: the expectation of Error(minimizing) is positive; that is, the AIC score of M min (y) is a biased estimator of its predictive accuracy. Generally, if $\hat r$ and $\hat s$ are unbiased estimators of r and s, max( $\hat r$ , $\hat s$ ) is usually a biased estimator (in fact over-estimator) of max(r,s). Suppose r is in fact bigger than s. Obviously, max( $\hat r$ , $\hat s$ ) ≥ $\hat r$ , and after taking expectation, E(max( $\hat r$ , $\hat s$ )) ≥ E( $\hat r$ ) = r = max(r,s). However, if P( $\hat r$ < $\hat s$ ) > 0, max( $\hat r$ , $\hat s$ ) > $\hat r$ with positive probability and therefore, E(max( $\hat r$ , $\hat s$ )) > max(r, s). But unless r and s are massively different or $\hat r$ and $\hat s$ are estimators with extremely small variances, P( $\hat r$ < $\hat s$ ) > 0. In fact, in the case of AIC scores, even if model A is significantly predictively more accurate than model B, there is still a positive (however small) probability that one acquires a data set for which AIC(B) < AIC(A).
The expected value of Error(minimizing) is normally small. However, its size grows with an increase in the number of models. Without going into the technical details, we will only gesture towards an explanation for why this happens.Footnote 24 Recall that the AIC score of a model, unlike its predictive accuracy, is a random variable; it varies with different data sets. Equation (4) states that the average value of AIC equals predictive accuracy, but it doesn’t say anything else about the distribution of AIC. In the most straightforward cases, −½AIC has an approximately chi-squared distribution plus a constant. Thus, in an inference problem with N models, M 1 , M 2 , …, M N , we can think of −½AICs as N approximately chi-squared distributed random variables (plus a constant) with means equal to predictive accuracies, A(M 1 ), A(M 2 ), …, A(M N ). For simplicity, suppose that predictive accuracies are not very different. (Neither this assumption nor the chi-square distribution is necessary for the validity of our argument, but they make things easier for our explanatory purposes.) It is unlikely for each such variable to be significantly smaller than the mean value, but if N is large, it is very likely that at least one of the variables (and thereby the minimum of all of them) is significantly smaller than the mean. Therefore, the more candidate models you have, the more biased the AIC value of M min will be.
The fact that AIC is an unbiased estimator of predictive accuracy can mislead one into thinking that one can add to one’s candidate models at will, in the hope that if any model is not plausible it will have a poor AIC score and will thus be discarded. This is a mistaken idea. Choosing the model with minimum AIC score can be a hopelessly misguided practice if one isn’t stingy about which models to consider in the first place.
8. Concluding Remarks
We discussed Forster and Sober’s solution to the sub-family problem. Although we agree with much of what they say, we disagree about a potential implication of their argument concerning the singleton family problem. We offered our own solution for the sub-family problem, which makes the difference between the two problems salient.
Although we find the sub-family problem interesting in itself, we believe a much more important lesson about simplicity-favoring considerations can be learned from our solution to the problem. Here we would like briefly to discuss what is going on beyond the technicalities. The fundamental difference between the sub-family problem and the singleton family problem is that the sub-family model is itself a random variable dependent on and tailored to the data. (Hence the difference between equations [4] and [8].) Simple models that are designed to have excellent goodness-of-fit with the extant data tend to perform poorly in predicting future data. Footnote 25 Here is a nontechnical explanation for this. Consider the very idea behind the Akaikean framework. Why is it a bad idea to use the goodness-of-fit of the best fitting member of a model as an estimate of the model’s predictive accuracy? Because to do so would essentially amount to using the current data twice: both in determining the best fitting member of the model and in determining how close the model is to the data, which is measured in terms of the fit of that same best fitting member. If the data was not used to pick out the representative (best fitting) member of the model (as is the case in singleton models), the fit of the model with the extant data was a good (unbiased) estimator of its predictive accuracy. Thus, the source of the problem with this proposal is essentially the double-use of the data. Footnote 26 But the more complex the model is, the more effective such double-use will be, because the data will have more power in selecting among the members of the model. That is why the bias in taking the goodness-of-fit of the model as an estimator of its predictive accuracy increases when the complexity of the model increases. The beauty of Akaike’s results is in offering a way to calculate this bias. Now, when one designs one’s model to be simple and to have an excellent degree of fit with the current data, one re-introduces that bias into one’s estimation of predictive accuracy. In such an event, AIC is no longer an unbiased estimator, because the bias introduced by the double-use of the data is not relevant only to the number of adjustable parameters (which AIC corrects for) but also to the construction of the model itself (which AIC does not correct for).
We also discussed another problematic practice, which involves considering too many candidate models. We showed that one can effectively re-introduce the same error involved in the sub-family problem by engaging in an extreme version of this practice. Although post hoc model construction and comparing too many candidate models are problematic for two different reasons, there is a unified solution for both, namely, to respect IMR.
Before ending the paper, we would like to mention an important practical consequence of IMR. Respecting IMR gives rise to a clear criterion for determining whether a given model is gerrymandered or not. In the highly artificial examples in which the sub-family problem or similar problems are usually formulated (such as in Kukla Reference Kukla1995), it is crystal clear which models are fudged or gerrymandered (e.g., the sub-family model). However, in more realistic cases, it is sometimes not so clear. Thus, Douglas and Magnus write: “it would be perverse to do this arbitrarily, but in the general case of n-parameter models it may be possible to motivate specific values for some of the parameters. There is no formal rule for when this is or is not legitimate” (Douglas and Magnus Reference Douglas and Magnus2013, 583). A comparison between the sub-family problem and the singleton family problem suggests exactly such a rule: models in which certain parameters are held fixed are not fudged just in case there are grounds independently of the current data for holding them so fixed.
Acknowledgments
We are grateful to Adam Elga, Elliott Sober, Erfan Salavati, David Schroeren, two anonymous referees, and an audience at Institute for Research in Fundamental Sciences in Tehran for their helpful feedback.