1 Introduction
Social scientists and philosophers often evaluate judgments against two gold standards: correspondence and coherence (Hammond, 1996, 2007). Measures of correspondence capture the degree to which judgments agree with empirical observations (e.g., Cooksey, 1996; Reference HammondHammond, 1996), and coherence criteria assess the degree to which judgments are consistent with logical or axiomatic principles. Both standards are widely accepted as important components of good judgment (Reference Dunwoody and CollegeDunwoody & College, 2009; Reference HammondHammond, 2000).
1.1 How Well do People Meet These Standards?
Previous research suggests that human judgment tends to fall short on coherence and correspondence. Coherence violations range from base rate neglect and confirmation bias to overconfidence and framing effects (Gilovich, Griffith & Kahneman, 2002; Kahneman, Slovic & Tversky, 1982). Experts are not immune. Statisticians (Reference Christensen-Szalanski and BushyheadChristensen-Szalanski & Bushyhead, 1981), doctors (Reference EddyEddy, 1982), and nurses (Reference BennettBennett, 1980) neglect base rates. Physicians and intelligence professionals are susceptible to framing effects (Reference Aberegg, Arkes and TerryAberegg, Arkes & Terry, 2006; Reference Reyna, Chick, Corbin and HsiaReyna, Chick, Corbin & Hsia, 2014), and financial investors are prone to overconfidence (Reference Barber and OdeanBarber & Odean, 2001).
Research on correspondence tells a similar story. Numerous studies show that human predictions are frequently inaccurate and worse than simple linear models in many domains (e.g., Meehl, 1954; Reference Dawes, Faust and MeehlDawes, Faust & Meehl, 1989). Once again, expertise doesn’t necessarily help. Inaccurate predictions have been found in parole officers (Reference Carroll and PayneCarroll & Payne, 1977), court judges (Ebbesen & Konecki, 1975), investment managers in the US and Taiwan (Reference OlsenOlsen, 1997), and politicians (Reference TetlockTetlock, 2005). However, expert predictions are better when the forecasting environment provides regular, clear feedback and there are repeated opportunities to learn (Reference Kahneman and KleinKahneman & Klein, 2009; Reference ShanteauShanteau, 1992). Examples include meteorologists (Reference Murphy and WinklerMurphy & Winkler, 1984), professional bridge players (Reference KerenKeren, 1987), and bookmakers at the racetrack (Reference Bruce and JohnsonBruce & Johnson, 2003), all of whom are well-calibrated in their own domains.
1.2 Silver Standards
In many cases, judgment quality is important, but gold standards are unavailable. How “good” is a physician’s diagnosis, for example? Or an instructor’s grade, or a judge’s sentencing decision? Einhorn (1972, 1974), and Reference Weiss and ShanteauWeiss & Shanteau (2003) suggested that, at a minimum, good judges (i.e., domain experts) should demonstrate consistency and discrimination in their judgments. In other words, experts should make similar judgments if cases are alike, and dissimilar judgments when cases are unalike. Indeed, some would argue that these skills are essential to expertise.
In a test of the consistency, Skånér, Strender & Bring (1998) asked 27 general practitioners (GPs) to estimate the probability of patient heart failure from a series of vignettes based on actual patients. In the vignettes, GPs were given diagnostic cues such as age, sex, and history of myocardial infarction, but the patients’ eventual survival status could not be obtained. GPs were presented with 45 cases, five of which were presented twice, and consistency was operationalized as the absolute difference in survival estimates on the five repeated cases. Individual consistency varied greatly. Absolute differences fell between 0 and 10% in 62% of cases, 11% to 20% in 25% of cases, and greater than 20% in 13% of cases. In another test of consistency, Reference Dhami and AytonDhami & Ayton (2001) found inconsistency in magistrate’s decisions across repeated trials in a laboratory task of bail setting.
1.3 Connections Among Standards
We know of no studies that have examined good judgment using all four of the standards. One study examined three of them (Reference Weiss, Brennan, Thomas, Kirlik and MillerWeiss, Brennan, Thomas, Kirlik & Miller, 2009). Using a golf putting task with experienced golfers, they found a strong correlation between accuracy and their combined measure of consistency and discrimination.
In a handful of studies, researchers have investigated connections between the gold standards, but results have been mixed. Most studies show weak connections (Reference Wright and AytonWright & Ayton, 1987a; 1987b; Reference Wright, Rowe, Bolger and GammackWright, Rowe, Bolger & Gammack, 1994; Reference Adam and ReynaAdam & Reyna, 2005; Reference Weaver and StewartWeaver & Stewart, 2012; Reference Dunwoody, Goodie and MahanDunwoody et al., 2005), although one could argue that Weiss et al.’s results are an exception. For the most part, however, measures of coherence and correspondence are loosely coupled.
Yet under the right conditions, loose couplings may tighten. In research on forecasting, for example, some studies have given subjects a battery of coherence tasks prior to having them make forecasts about uncertain events. The results show that aggregate forecasts are more accurate (i.e., more correspondent) when subjects’ predictions are weighted by their scores on the coherence tasks, rather than combined with a simple unweighted average. Indeed, coherence-based weighting schemes have been shown to improve the accuracy of the aggregate by more than 30%, relative to a simple mean (Reference Predd, Osherson, Kulkarni and PoorPredd, Osherson, Kulkarni & Poor, 2008; Reference Wang, Kulkarni, Poor and OshersonWang Kulkarni, Poor & Osherson, 2011; Reference Tsai and KirlikTsai & Kirlik, 2012; Reference Karvetski, Olson, Mandel and TwardyKarvetski, Olson, Mandel & Twardy, 2013). Perhaps correspondence and coherence are intertwined, but we’ve been looking in the wrong places.
1.4 Superforecasters
In previous research, we have examined a group of individuals with remarkable correspondence skills in geopolitical forecasting. These individuals were identified over the course of four geopolitical forecasting tournaments sponsored by the Intelligence Advanced Research Projects Activity (IARPA), the research and development wing of the U.S. intelligence community (Reference Mellers, Stone, Murray, Minster, Rohrbaugh, Bishop, Chen, Baker, Hou, Horowitz, Ungar and TetlockMellers, Stone, Murray, Minster, Rohrbaugh, Bishop, Chen, Baker, Hou, Horowitz, Ungar & Tetlock, 2015a; Reference Tetlock and GardnerTetlock & Gardner, 2015). In these tournaments, thousands of people predicted the outcomes of questions such as “Will the U.N General Assembly recognize a Palestinian state by September, 30, 2011?” Each year, the top 2% of subjects were designated “superforecasters” and were assigned to work together in elite teams. In this richer setting, superforecasters became more accurate and resisted regression to the mean, suggesting that their accuracy was driven at least in part by skill, rather than luck (Reference Mellers, Stone, Atanasov, Rohrbaugh, Metz, Ungar, Bishop, Horowitz, Merkle and TetlockMellers, Stone, Atanasov, Rohrbaugh, Metz, Ungar, Bishop, Horowitz, Merkle & Tetlock, 2015b). Indeed, using Brier scores to measured accuracy, Reference Goldstein, Hartman, Comstock and BaumgartenGoldstein, Hartman, Comstock and Baumgarten (2016) found that superforecasters outperformed U.S. intelligence analysts on the same questions by roughly 30%.
Who were these remarkable individuals? Superforecasters were well-educated, all having at least college degrees, and about two thirds with post-graduate degrees. They came from a diverse range of fields, including law, pharmacy, biochemistry, photography, computer science, venture capitalism, and mathematics. Some had training in probability and statistics, but none were subject matter experts because the range of topics in the tournaments was too great for any person to master singlehandedly. One could say that superforecasters were better characterized as super-generalists than as subject matter experts (Reference Tetlock and GardnerTetlock & Gardner, 2015).
What explained superforecasters’ success? Several hypotheses are supported (Reference Mellers, Stone, Murray, Minster, Rohrbaugh, Bishop, Chen, Baker, Hou, Horowitz, Ungar and TetlockMellers et al., 2015a). First, superforecasters were more actively open-minded and had greater fluid and crystallized intelligence than the average forecaster. Second, they were more motivated, as measured by the number of questions they attempted and the rate at which they updated their predictions. Third, they acquired task-specific skills, including scope sensitivity, which is the calibration of probabilities across regional, temporal, and quantitative dimensions (Reference Mellers, Stone, Murray, Minster, Rohrbaugh, Bishop, Chen, Baker, Hou, Horowitz, Ungar and TetlockMellers, et al., 2015a). They also expressed their beliefs in a more nuanced way by making more distinctions on the probability scale than other forecasters (Reference Friedman, Baker, Mellers, Tetlock and ZeckhauserFriedman, Baker, Mellers, Tetlock & Zeckhauser, 2017). Fourth, superforecasters, who always worked in teams, had more stimulating environments than others. Compared to regular teams, superforecaster shared more information, engaged in more discussions, offered more encouragement, and conducted more in-depth post mortems to gain insights from their successes and failures.
1.5 The Present Research
We ask a question that follows naturally from the forecasting tournaments discussed earlier. Are superforecasters, who were markedly better on measures of correspondence than their peers, also better on other standards of good judgment, including consistency, discrimination, and coherence? We use data from two online surveys to compare the judgments of superforecaters to those made by a less elite group of forecasters (hereafter called “regular” forecasters) and University of Pennsylvania undergraduates. Regular forecasters participated in the same tournaments as superforecasters, but were not as accurate, received no training in probabilistic reasoning, and worked alone rather than in teams. Undergraduates had no experience with forecasting at all.
1.6 Evaluating the Other Three Benchmarks
There are numerous ways to assess each standard of judgment. We evaluated consistency and discrimination using a task from an early experiment by Reference Wallsten, Budescu, Rapoport, Zwick and ForsythWallsten, Budescu, Rapoport, Zwick & Forsyth (1986). Wallsten et al.’s subjects assigned numerical values to verbal uncertainty phrases such as “likely”, “unlikely”, and “possible”. Subjects showed widespread individual differences in their numerical interpretations of the phrases. Moreover, when the uncertainty phrases were paired with events, numerical interpretations varied even more (Reference CohenCohen, 1986; Reference MandelMandel, 2015). A “good chance” of having cancer, for example, was judged as likelier than a “good chance” of rain. We will measure consistency using a similar task. We’ll focus on the extent to which probability phrases are influenced by events. We’ll measure discrimination by computing the difference in numerical estimates for the most extreme probability phrases.
Coherence is often assessed using Bayesian reasoning problems (Reference Kahneman and TverskyKahneman & Tversky, 1985; Reference Gigerenzer and HoffrageGigerenzer & Hoffrage, 1995). We used these problems, but also included questions about information needed for updating one’s beliefs, questions about which of two pieces of information is more diagnostic, and questions about whether to use normatively useless information.
The questions about information needed to update one’s beliefs come from a pseudodiagnosticity task developed by Reference Doherty, Mynatt, Tweney and SchiavoDoherty, Mynatt, Tweney and Schiavo (1979). Subjects read about an undersea explorer who found a valuable pot and wanted to return it to its place of origin – either Coral Island or Shell Island. The pot had smooth (vs. rough) clay and a curved (vs. straight) handle. Subjects were asked which information they most wanted to see from a limited set of probabilities. Not being Bayesians, subjects often selected nondiagnostic pairs of information. We also used questions from Reference Baron, Beattie and HersheyBaron, Beattie and Hershey (1988) about congruence and information biases. The congruence bias is a preference for less diagnostic information if that information is more likely to give a positive result for the favored hypothesis, and the information bias is a preference for more information, even if it will not change the decision.
1.7 Hypotheses
Following from the conjecture that expertise moderates the strength of the relationships among gold- and silver-standards of good judgment, we propose that the skills required to excel on correspondence tasks might also manifest themselves in performance on other benchmarks. Consequently, we predicted that superforecasters would be more coherent, more consistent, and better able to discriminate than other groups of subjects. Similarly, we expected regular forecasters to have acquired some degree of expertise with correspondence skills (but perhaps not as much), and undergraduates to have acquired less expertise still. Thus, we expected that regular forecasters would perform second best on the judgment tasks, and that undergraduates would perform the worst. Obviously, our laboratory tasks are only a small set of the possible ways we could assess good judgment, and groups could be equally skilled at some of them. In sum, our hypothesis was that superforecasters would perform at least as well as comparison groups on tasks of consistency, discrimination, or coherence.
2 Method
We administered two surveys, six months apart. Table 1 shows a summary of the relevant tasks in each survey that are discussed in greater detail below. In both surveys, tasks were presented in the order shown in Table 1.
* See footnote 1
Subjects.
In the first survey, responses were obtained from 173 superforecasters (69% response rate); 75 regular forecasters (48% response rate); and 92 undergraduates. In the second survey, responses were obtained from 67 superforecasters (52% response rate); 90 regular forecasters (47% response rate); and 186 undergraduates. Some forecasters participated in both surveys. They included 52 superforecasters and 35 regular forecasters. Forecasters were, on average, 37.1 years (SD = 8.3) old, and all of them had at least a Bachelors’ degree. Undergraduates were, on average, 19.7 years old (SD = 0.8), and none of them had acquired college degrees.
Incentives.
Forecasters were rewarded with an electronic book upon the completion of each survey. Undergraduates were recruited by the Wharton Behavioral Lab at the University of Pennsylvania, and were paid $10 upon the completion of each survey (roughly the cost of an electronic book).
Consistency and Discrimination Tasks.
In each task, we combined five distinct uncertainty phrases with a selection of events based on (then) current events. We then asked subjects to provide numerical estimates of the “best”, “lowest”, and “highest” plausible interpretations of these phrase-event pairs. This task gave us two measures of consistency. The first was the independence of numerical interpretations across events. If superforecasters were more consistent in their numerical interpretations, they should show a smaller main effect of events than other groups. The second measure of consistency was the width of the plausible interval (“highest”–“lowest”) assigned to each phrase-event pair. Insofar as consistency implies precision, greater consistency would mean narrower plausible ranges.
This task also gave us a measure of discrimination. Though there is little normative basis for saying that better discriminators “should” show greater differences in their interpretation of phrases such as “doubtful” and “possible”, there are reasonable grounds for assuming that better discriminators would make greater distinctions between more extreme uncertainty phrases. Thus, we measured discrimination by examining the numerical difference in interpretations assigned to “almost certain” and “almost impossible”.Footnote 1
Stimuli.
In each survey, subjects were presented with a different version of the verbal uncertainty task. They imagined that they were reading an intelligence report that contained statements such as “It is likely that, before the end of 2014, China will seize control of Second Thomas Shoal.” Then they provided numerical estimates of the “best”, “lowest”, and “highest” plausible interpretations of each statement.
Probability phrases and events were combined using a 5 × 5 factorial design. Uncertainty phrases were “almost impossible”, “doubtful”, “possible”, “likely”, and “almost certain.” In the first survey, events were:
1. China will seize control of the Second Thomas Shoal in the South China Sea before the end of 2014.
2. The kidnapped girls in Nigeria will be brought back alive before the end of 2014.
3. North Korea will conduct a new multistage rocket or missile launch before September, 2014.
4. China’s annual GDP growth rate will be less than 7% in the first fiscal quarter of 2015.
5. Russian armed forces will invade or enter East Ukraine before October, 2014.
In the second survey, events were:
1. A woman will be elected president of the United States in the 2016 election.
2. A private spaceflight company will allow members of the public to purchase passage to the moon by January, 2040.
3. Additional U.S. states will legalize the recreational use of marijuana by January, 2017.
4. The U.S. Mint will discontinue production of the penny by January, 2019.
5. An HIV vaccine will be commercially available by January, 2030.
In each survey, subjects were given one of five sets of stimulus pairs on a random basis. Each set consisted of five of the 25 event-phrase pairs, with each probability phrase and each event presented once. The order of event-phrase pairs differed across sets according to a Latin square design.
Coherence Tasks.
We used four tasks to assess coherence in our two surveys. In the first survey, we examined pseudodiagnosticity, the congruence bias, and the information bias. To examine pseudodiagnositicity, we created a new version of Doherty et al.’s (1979) scenario called “The Espionage Game”. Instructions read:
Imagine you have intercepted an encoded radio transmission from one of two possible opposing teams — the Red Team or the Blue Team — indicating that a missile strike will be launched against your home base within the hour. The fighter jets under your command can successfully neutralize this attack, but only if you are able to correctly determine which opposing team has ordered the strike.
One method you could use to try to determine this would be to compare the characteristics and content of the intercepted message with what is known about the history of transmissions sent by the two opposing teams. Some of this information is readily available to you, and some you can gain by hacking into the computer systems of the Red and Blue teams. To avoid detection, however, you will only be able to steal a limited amount of information from the opposing teams.
The Red and Blue Teams have sent 1000 messages in the past 24 hours. Here are the characteristics of the intercepted message:
1. It was encoded using the Alpha cipher (not the Omega cipher)
2. It was transmitted using a Low frequency (not a High frequency)
You can now gather information about the radio transmissions sent by the Red and the Blue Teams in the last 24 hours. To do so, you may hack into the opposing team’s computer networks and steal any two of the four available data logs by clicking the appropriate buttons below.
Subjects knew that the intercepted message had been encoded with the Alpha cipher and was transmitted at Low frequency. Their task was to select two of the four pieces of information below, each of which contained complementary conditional probabilities of message characteristics given a particular team.
1. p(Alpha|Red) and p(Omega|Red)
2. p(Alpha|Blue) and p(Omega|Blue)
3. p(Low|Red) and p(High|Red)
4. p(Low|Blue) and p(High|Blue)
Subjects then decided which team had sent the message.
The congruence and information biases from Baron et al. (1988) were vignettes. Both tasks had the same opening paragraph:
You will take on the role of a physician and be asked to make decisions regarding hypothetical medical scenarios. Your task will be to choose the course of action that you feel is best. For each question, you will be given information regarding the symptoms and history of a patient with an unidentified illness. You will then be given the opportunity to perform certain tests to aid in your diagnosis. In all cases, whether you perform the tests or not, you will always treat the most likely disease.
The second paragraph of the congruence bias task read:
A patient has a .8 probability of having Chamber-of-Commerce disease and a .2 probability of Elk’s disease. (He surely has one or the other.) A tetherscopic examination yields a positive result in 90% of patients with Chamber-of-Commerce disease and in 20% of patients without it (including those with some other disease). An intraocular smear yields a positive result in 90% of patients with Elk’s disease and in 10% of patients without it.
A normative analysis of the information above indicates that the intraocular smear is more diagnostic. However, the tetherscopic examination has a greater chance of a positive result for the likelier disease. Subjects who select the tetherscopic examination exhibit the congruence bias.
The second paragraph of the information bias task read:
A patient is presenting symptoms and a history that suggest a diagnosis of globoma, with about .8 probability. If it isn’t globoma, it’s either popitis or flapemia. Each disease has its own treatment, which is ineffective against the other two diseases. A test called the ET scan would certainly yield a positive result if the patient had popitis, and a negative result if she has flapemia. If the patient has globoma, a positive and negative result are equally likely. If the ET scan were the only test you could do, should you do it? (Yes/No)
From a Bayesian perspective, results from the ET scan are not suitably diagnostic to change the physician’s normative course of action and the expected marginal utility of the test is, at best, zero. Subjects who respond “Yes” to this question (i.e., subjects who would elect to conduct the ET scan) show the information bias.
In the second survey, we assessed coherence using five Bayesian reasoning problems. The first was adapted from Eddy (1982):
Breast Cancer. The probability of breast cancer is 1% for a woman at age 40 who participates in routine screening. If a woman has breast cancer, the probability is 80% she will get a positive mammography. If a woman does not have breast cancer, the probability is 9.6% she will also get a positive mammography. A woman in this age group gets a positive mammography test result in a routine screening. What is the probability she actually has breast cancer?Footnote 2
The four other Bayesian updating problems were:
Die/Urn. Imagine Sue throws a fair, six-sided die. If it comes up with the number "1," she picks a ball from Urn A. With any other number, she will pick a ball from Urn B. Urn A has 2 black balls and 1 white ball. Urn B has 1 black ball and 2 white balls. A black ball is selected from one of the two urns. What is the probability that Urn A was selected?
Engine. Imagine a device has been invented for screening engine parts for internal cracks. Most parts are without cracks, but past results show that 10 in 1000 parts have a crack. Of parts with cracks, 9 out of 10 will be correctly identified with the screening device. Of parts without cracks, 10 out of 990 will be incorrectly identified as having cracks. An engine part is tested and the result is a crack. What is the probability the engine part really is cracked?
Cookies. Suppose there are two opaque jars, each full of cookies. Jar #1 has 10 chocolate-chip cookies and 30 plain cookies. Jar #2 has 20 cookies of each type. Fred picks one of the jars at random. Then without looking, he randomly takes a cookie from that jar. The cookie is a plain one. Fred wonders which of the two jars he originally selected. The plain cookie is one piece of evidence. Given that the cookie he selected was plain, what is the probability that the cookie came from Jar #1?
HIV. In the 1980’s, a particular HIV test was used in the United States. At the time, 10 out of every 1000 people in the US were infected with HIV, and 990 were not. Imagine that, during this time, public health researchers selected a random sample of 1000 Americans to be tested for HIV. In this sample, all 10 of the people who truly had HIV tested positive, and 40 of the remaining people (i.e. those who did not have HIV) also tested positive. Imagine, now, that a new person takes the test, and has a positive result. If the results from the previous sample hold true for the population at large, what is the probability that this person actually has HIV?
3 Results
Consistency.
Figures 1and Figures 2 show numerical interpretations associated with each event-phrase pair in the two verbal uncertainty tasks from the first and second surveys, respectively. Each curve represents the interpretations of a given probability phrase; differences between curves represent the effects of probability phrases. The ups and downs of each curve show the effects of events. Accordingly, flatter curves represent greater consistency in numerical interpretation.
The figures show that, in Survey 1, the curves are generally flatter for superforecasters than for other groups. To examine consistency of phrases over events, we conducted a series of linear regressions predicting numerical estimates from probability phrases, events, and the interaction for each group on each survey.Footnote 3
To evaluate consistency, we assessed the effects of events and the interaction.Footnote 4 We’ll discuss the effect of phrases later, in the context of discrimination. Effects of events were roughly the same across groups and not different for any comparisons (Supers: .017, .016; Regulars: .022, .021; Undergraduates: .005, .019 in Surveys 1 and 2, respectively). The interaction between events and phrases was not consistently greater or less across surveys for Supers (.006, .004) than for Regulars (-.013, .010) or Undergraduates (.023, .012). In sum, Supers were neither more nor less sensitive to events than other groups in their numerical interpretations.
The second measure of consistency was the range of the plausible intervals assigned to each event-phrase pair, averaged over events. Table 2 shows average plausible intervals for each phrase and each group in each survey. Comparisons of superforecasters’ intervals to those of other groups showed that superforecasters tended to assign narrower plausible intervals than other groups in Survey 1 (5 out of 10 pairwise cases, p’s < 0.05) and in Survey 2 (10 out of 10 pairwise cases, p’s < 0.023). To summarize, superforecasters were generally more consistent in their numerical interpretations of probability phrases than other groups. Using plausible ranges of phrases, we also found that superforecasters were at least as consistent or more consistent than the other groups.
* Significance Levels: p < 0.05 ;
** p < 0.01 ;
*** p < 0.001
Asterisks indicate significant differences in plausible interval width, relative to superforecasters; Welch’s two-sample t-test
Discrimination.
In both surveys, superforecasters made greater distinctions among uncertainty phrases than other groups. In the regressions mentioned earlier, the effects of phrase was greater for Supers (.143, .181) than for Regulars (.085, .129) and Undergraduates .060, .092). All differences were significant at p < .001. We also compared the differences between the two most extreme uncertainty phrases across groups. Differences were statistically significant in both the first survey (Welch’s two-sample t-test: t(138.7) = 3.35, p = .001 and t(214.3) = 5.40, p < 0.001, respectively) and the second (Welch’s two-sample t-test: t(152.1) = 3.11, p = .002 and t(149.1) = 6.12, p < .001, respectively). One can see larger spaces between the most extreme curves in Figures 1 and 2 for superforecasters relative to other groups.
Coherence.
The first survey measured pseudodiagnosticity, the congruence bias, and the information bias task. In the pseudodiagnosticity task, superforecasters selected diagnostic pairs of items slightly more often than regular forecasters and undergraduates (41%, 36%, and 32%, respectively), but differences were not statistically significant. However, superforecasters were the only ones to select diagnostic pairs more often than chance (t(172) = 2.05, p = 0.041).
Table 3 displays the proportions of correct responses for the congruence and information bias tasks. Superforecasters were more likely to choose the more diagnostic test and avoid the congruence bias relative to regular forecasters and undergraduates (Welch’s two-sample t-test: t(124.6) = 2.19, p = 0.030 and t(160.8) = 3.67, p < .001, respectively). Superforecasters and regular forecasters were equally likely to avoid the information bias, but superforecasters were more likely to avoid the bias than undergraduates (Welch’s two-sample t-test: t(175.5) = 5.27, p < 0.001).
Survey 2 tested coherence using five Bayesian reasoning questions. Table 4 shows the percentage of correct responses (± 1 percentage point of the normative Bayesian solution) for each group. Superforecasters were significantly more likely to be correct than the other groups on each question (all p’s < 0.003). We wondered whether superforecasters who were incorrect were still “more” correct than those who were incorrect in the other groups. To find out, we compared the average absolute errors of regular forecasters and undergraduates to those of superforecasters with and without anyone who gave correct answers.
Table 5 shows the results. The left side of the table shows comparisons of average absolute errors of superforecasters with each group using all subjects. Almost all of the tests (9 out of 10) show that superforecasters had smaller errors than other groups. The right side of the table shows the same tests using only those subjects who were incorrect. If these superforecasters were still better than the others, they too should have smaller errors. But that is not the case. In only 1 of the 10 times did superforecasters have smaller absolute errors. Thus, more superforecasters knew (or applied) Bayes’ rule than the other groups, but those who were incorrect were just as incorrect as others.
* Significance Levels: p < 0.05;
** p< 0.01;
*** p < 0.001
Asterisks indicate significant differences in prop. correct, relative to superforecasters; Welch’s two-sample t-test
* Significance Levels: p < 0.05 ;
** p< 0.01 ;
*** p < 0.001
Asterisks indicate significant differences in absolute error, relative to superforecasters; Welch’s two-sample t-test
Are Good Judgment Skills Related?
Finally, we explore the question of whether performance on good judgment hangs together. One way to address this question is to examine pairwise correlations between measures of skill in discrimination, consistency, and coherence. Suppose the measures from each survey were “items” on a scale of good judgment. We have five items in Survey 1, and three items in Survey 2 (discussed below). All relevant variables were coded such that lower scores indicated higher performance.
Consistency was represented by the average width of an individual’s plausible intervals in the verbal probability task, and discrimination was represented by (one minus) the difference between “best” estimates for extreme probability phrases (“almost impossible” and “almost certain”). In the first survey, coherence was represented by (three minus) the sum of three binary variables: the mean score of diagnostic pairs selected in the pseudodiagnosticity task (original scale: 0 = not diagnostic; 1 = diagnostic); performance on the congruence bias task (original scale: 0 = incorrect; 1 = correct); and performance on the information bias task (original scale: 0 = incorrect; 1 = correct). In the second survey, coherence was represented by a subject’s average absolute error in the Bayesian reasoning task. Table 6 presents correlations between measures for all groups. Although these correlations are not large, they are all positive and five of six differ significantly from zero, suggesting that these skills are at least somewhat interrelated.
* Significance Levels: p < 0.05 ;
** p< 0.01 ;
*** p < 0.001
Another way to ask whether good judgment generalizes is to compute a single composite score across consistency, discrimination, and coherence measures and ask whether it distinguishes superforecasters from other groups. Using the same coding, but also standardizing each measure after pooling over all groups, we computed a single score for each individual from the five measures in Survey 1 and the three measures in Survey 2. We then computed the biserial correlation between each single composite score and a binary indicator that distinguished superforecasters from others (0 = superforecaster, 1 = other). In Survey 1, the resulting correlation was .46, and in Survey 2, it was .60.
Because we know that the “items” in our scales were not perfectly reliable, we can safely assume that these correlations are lower than they would have been without measurement error. To statistically correct for attenuation, we divide each correlation by the square root of the reliability of the corresponding composite scale (here, Cronbach’s α). In principle, this procedure should yield an estimate of the “true” correlation between our composite scores and superforefaster status, had the former been measured with perfect reliability. After implementing this procedure, the resulting corrected biserial correlations were .73 and .84 in Surveys 1 and 2, respectively. These analyses suggest that good judgment may be a more unitary phenomenon than previously supposed.
4 Discussion
To understand whether different facets of good judgment are interconnected, we investigated whether superforecasters, who excelled at correspondence, would also be better than regular forecasters and undergraduates on laboratory tasks designed to measure three other benchmarks of good judgment. Those were consistency and discrimination (silver standards that can be applied when gold standards are elusive), and coherence (often granted deference as one of two gold standards).
Although some may think consistency and discrimination are weak indicators of good judgment, they shouldn’t be underrated. These skills have practical value. An ongoing debate in the U.S. Intelligence Community concerns the relative merits of verbal versus numerical expressions of uncertainty. Although analysts prefer verbal estimates such as “likely” or “almost impossible” when communicating probabilities, the interpretations of these terms can vary enormously, both within and across individuals. As a result, the intent of analysts and the interpretation of their words can often be at odds, leading to the potential for costly errors in policy decisions (Reference Tetlock and GardnerTetlock & Gardner, 2015, chapter 3).
Superforecasters were either best or tied for best on all tasks assessing performance on the other benchmarks in two surveys. Across two surveys, we found modest but systematic correlations among benchmarks. Correlations ranged from .08 to .39, with an average of .20. We then computed a single composite score of good judgment based on all of the measures in each survey. The composite was strongly related to superforecaster status. Correlations were .46 and .60, respectively. Correcting for attenuation boosted the correlations to .73 and .82.
Previous studies of performance on correspondence and coherence tasks have tended to find weak associations (Reference Wright, Rowe, Bolger and GammackWright et al., 1994; Reference Wright and AytonWright & Ayton, 1987a, 1987b). These studies have not typically used groups who varied greatly in their skills on one or more benchmarks. Perhaps that explains the differences in the results. Correspondence and coherence appear to be more closely related among experts than novices. In one study that used more skilled subjects, Weiss et al. (2009) found a higher correlation between benchmarks than usually obtained.
If the skills involved in good judgment tend to go together, disagreements over the merits of one benchmark versus another may not be as critical as some suppose. Some think that the coherence benchmark, as measured in psychological tasks, does not reflect rationality. Responses to artificial laboratory tasks (made by subjects who might be bored and are in an unfamiliar environment), these authors argue, do not reflect good judgment (Reference Arkes, Gigerenzer and HertwigArkes, Gigerenzer & Hertwig, 2016). Defenders of coherence benchmarks respond by saying that the same mistakes have been demonstrated with serious professionals who are not only familiar with the tasks, but stake their reputations on them (Reference Croskerry and NormanCroskerry & Norman, 2008; Reference DawsonDawson & Arkes, 1987; Reference Guthrie and WistrichGuthrie, Rachlinski & Wistrich, 2007).
4.1 Is Good Judgment the Same as Intelligence?
Good judgment requires intelligence, so it is reasonable to ask how closely they are connected. Different studies have investigated the relationship between intelligence and different standards of good judgments. Mellers et al. (2015b) examined correlations between correspondence a measure of intelligence in forecasters from the geopolitical tournaments discussed earlier. Intelligence was measured using the Ravens APM (Bors & Stokes, 1998), and correspondence was measured with Brier scores. These variables were positively correlated, but the relationship was modest (r = 0.23).
Reference Stanovich, West and ToplakStanovich, West and Toplak (2016) investigated the relationship between some measures of intelligence and coherence using tasks of RQ, the rationality quotient. Reference Stanovich and WestStanovich and West (2014) reported correlations between IQ and RQ ranging from .20 to .35. Reference Weaver and StewartWeaver and Stewart (2012) measured intelligence, correspondence, and coherence in college students. The average correlation between intelligence and correspondence was .17, and that between intelligence and coherence was .28. Intelligence appears to be related to good judgment, but certainly not identical.
In sum, good judgment takes many forms and can be evaluated using silver or gold standards. Facets of good judgment are not only interlinked, they are also correlated with intelligence. Unlike some aspects of intelligence, good judgment can be learned, honed, and sharpened.
Anyone who believes there is no room for improvement has demonstrably poor judgment.