Published online by Cambridge University Press: 01 April 2022
Can there be good reasons for judging one set of probabilistic assertions more reliable than a second? There are many candidates for measuring “goodness“ of probabilistic forecasts. Here, I focus on one such aspirant: calibration. Calibration requires an alignment of announced probabilities and observed relative frequency, e.g., 50 percent of forecasts made with the announced probability of .5 occur, 70 percent of forecasts made with probability .7 occur, etc.
To summarize the conclusions: (i) Surveys designed to display calibration curves, from which a recalibration is to be calculated, are useless without due consideration for the interconnections between questions (forecasts) in the survey. (ii) Subject to feedback, calibration in the long run is otiose. It gives no ground for validating one coherent opinion over another as each coherent forecaster is (almost) sure of his own long-run calibration. (iii) Calibration in the short run is an inducement to hedge forecasts. A calibration score, in the short run, is improper. It gives the forecaster reason to feign violation of total evidence by enticing him to use the more predictable frequencies in a larger finite reference class than that directly relevant.
I thank Jay Kadane and Mark Schervish for helpful discussions about their important work on calibration, and Isaac Levi for his constructive criticism of this and earlier drafts. Also, I have benefited from conversations with M. De Groot and J. K. Ghosh.
Preliminary versions of this paper were delivered at the Meeting of the Society for Philosophy and Psychology, May 13–16, 1982, London, Ontario; and at Session TA10, “Modeling Uncertainty,“ of the TIMS/ORSA conference, April 27, 1983, Chicago, Illinois.
Research for this work was sponsored by a Washington University Faculty Research Grant.