Hostname: page-component-745bb68f8f-hvd4g Total loading time: 0 Render date: 2025-01-11T03:59:04.876Z Has data issue: false hasContentIssue false

Calibration, Coherence, and Scoring Rules

Published online by Cambridge University Press:  01 April 2022

Teddy Seidenfeld*
Affiliation:
Department of Philosophy, Washington University in St. Louis

Abstract

Can there be good reasons for judging one set of probabilistic assertions more reliable than a second? There are many candidates for measuring “goodness“ of probabilistic forecasts. Here, I focus on one such aspirant: calibration. Calibration requires an alignment of announced probabilities and observed relative frequency, e.g., 50 percent of forecasts made with the announced probability of .5 occur, 70 percent of forecasts made with probability .7 occur, etc.

To summarize the conclusions: (i) Surveys designed to display calibration curves, from which a recalibration is to be calculated, are useless without due consideration for the interconnections between questions (forecasts) in the survey. (ii) Subject to feedback, calibration in the long run is otiose. It gives no ground for validating one coherent opinion over another as each coherent forecaster is (almost) sure of his own long-run calibration. (iii) Calibration in the short run is an inducement to hedge forecasts. A calibration score, in the short run, is improper. It gives the forecaster reason to feign violation of total evidence by enticing him to use the more predictable frequencies in a larger finite reference class than that directly relevant.

Type
Research Article
Copyright
Copyright © 1985 by the Philosophy of Science Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

I thank Jay Kadane and Mark Schervish for helpful discussions about their important work on calibration, and Isaac Levi for his constructive criticism of this and earlier drafts. Also, I have benefited from conversations with M. De Groot and J. K. Ghosh.

Preliminary versions of this paper were delivered at the Meeting of the Society for Philosophy and Psychology, May 13–16, 1982, London, Ontario; and at Session TA10, “Modeling Uncertainty,“ of the TIMS/ORSA conference, April 27, 1983, Chicago, Illinois.

Research for this work was sponsored by a Washington University Faculty Research Grant.

References

Alpert, M., and Raiffa, H. (1982), “A progress report on the training of probability assessors”, in Judgment under Uncertainty: Heuristics and Biases, Kahneman, D., Slovic, P., and Tversky, A., (eds.). Cambridge: Cambridge University Press, pp. 294305. Hereafter, “Judgment under Uncertainty.”CrossRefGoogle Scholar
Blackwell, D. and Girshick, M. (1954), Theory of Games and Statistical Decisions. London and New York: John Wiley.Google Scholar
Brier, G. W. (1950), “Verification of Forecasts Expressed in Terms of Probability”, Monthly Weather Review 78: 13.2.0.CO;2>CrossRefGoogle Scholar
Bross, I. D. J. (1953), Design for Decision. New York: Macmillan.Google Scholar
Chen, R. (1977), “On Almost Sure Convergence in a Finitely Additive Setting”, Z. Wahrscheinlichkeitstheorie 37: 341–56.CrossRefGoogle Scholar
Dawid, A. P. (1982), “The Well Calibrated Bayesian”, Journal of the American Statistical Association 77: 605–10; discussion, 610–13.Google Scholar
De Groot, M., and Eriksson, E. (forthcoming), “Probability forecasting, stochastic dominance and the Lorenz curve”, in Proceedings of the Second International Meeting on Bayesian Statistics, Valencia, Spain, 1983.Google Scholar
De Groot, M., and Fienberg, S. E. (1981), “Assessing Probability Assessors: Calibration and Refinement”, Technical Report 105, Dept. of Statistics. Pittsburgh: Carnegie-Mellon University.Google Scholar
De Groot, M., and Fienberg, S. (1982), “The Comparison and Evaluation of Forecasters”, Technical Report 244, Department of Statistics. Pittsburgh: Camegie-Mellon University.Google Scholar
Dubins, L. (1974), “On Lebesgue-like Extensions of Finitely Additive Measures”, Annals of Probability 2: 456–63.CrossRefGoogle Scholar
Dubins, L. (1975), “Finitely Additive Conditional Probabilities, Conglomerability and Disintegrations”, Annals of Probability 3: 8999.CrossRefGoogle Scholar
Feller, W. (1966), An Introduction to Probability Theory and its Applications. Vol. 2. London and New York: John Wiley.Google Scholar
Finetti, B. de (1972), Probability, Induction and Statistics. London and New York: John Wiley.Google Scholar
Finetti, B. de (1974), Theory of Probability. Vol. 1. London and New York: John Wiley.Google Scholar
French, S. (forthcoming), “Group consensus probability distributions: a critical survey”, in Proceedings of the Second International Meeting on Bayesian Statistics, Valencia, Spain, 1983.Google Scholar
Gibbard, A. (1973), “Manipulation of Voting Schemes: A General Result”, Econometrica 41: 587601.CrossRefGoogle Scholar
Hoerl, A. E., and Fallin, H. K. (1974), “Reliability of Subjective Evaluations in a High Incentive Situation,” Journal of the Royal Statistical Society A 127: 227–30.Google Scholar
Horwich, P. (1982), Probability and Evidence, Cambridge: Cambridge University Press.Google Scholar
Kadane, J. B., and Lichtenstein, S. (1982), “A Subjectivist View of Calibration”, Technical Report 233, Dept. of Statistics. Pittsburgh: Camegie-Mellon University.Google Scholar
Kyburg, H. E. (1974), The Logical Foundations of Statistical Inference. Dordrecht: D. Reidel.CrossRefGoogle Scholar
Kyburg, H. E. (1978), “Subjective Probability: Considerations, Reflections, and Problems”, Journal Philosophical Logic 7: 157–80.CrossRefGoogle Scholar
Levi, I. (1980), The Enterprise of Knowledge, Cambridge: The MIT Press.Google Scholar
Levi, I. (1981), “Direct Inference and Confirmational Conditionalization”, Philosophy of Science 48: 532–52.CrossRefGoogle Scholar
Lichtenstein, S., and Fischhoff, B. (1977), “Do Those Who Know More also Know More about How Much They Know?Organizational Behavior and Human Performance 20: 159–83.CrossRefGoogle Scholar
Lichtenstein, S.; Fischhoff, B.; and Phillips, L. (1982), “Calibration of probabilities: The state of the art to 1980”, in Judgment under Uncertainty, Kahneman, D., Slovic, P., and Tversky, A. (eds.). Cambridge: Cambridge University Press, pp. 306–34.Google Scholar
Lindley, D. V. (1981), “Scoring rules and the Inevitability of Probability”, unpublished report, ORC 81~-1, Operations Research Center. Berkeley: University of California.CrossRefGoogle Scholar
Lindley, D. V. (forthcoming), “Reconciliation of discrete probability distributions”, in Proceedings of the Second International Meeting on Bayesian Statistics, Valencia, Spain, 1983.CrossRefGoogle Scholar
Lindley, D. V.; Tversky, A.; and Brown, R. V. (1979), “On the Reconcilliation of Probability Assessments”, with discussion, Journal of the Royal Statistical Society A 142: 146–80.CrossRefGoogle Scholar
Murphy, A. H. (1973a), “Hedging and Skill Scores for Probability Forecasts”, Journal of Applied Meteorology 12: 215–23.2.0.CO;2>CrossRefGoogle Scholar
Murphy, A. H. (1973b), “A New Vector Partition of the Probability Score”, Journal of Applied Meteorology 12: 595600.2.0.CO;2>CrossRefGoogle Scholar
Murphy, A. H. (1974), “A Sample Skill Score for Probability Forecasts”, Monthly Weather Review 102: 4855.2.0.CO;2>CrossRefGoogle Scholar
Murphy, A. H., and Epstein, E. S. (1967), “Verification of Probabilistic Predictions: A Brief Review”, Journal of Applied Meteorology 6: 748–55.2.0.CO;2>CrossRefGoogle Scholar
Murphy, A. H., and Winkler, R. L. (1977), “Reliability of Subjective Probability Forecasts of Precipitation and Temperature”, Applied Statistics 26: 4147.CrossRefGoogle Scholar
Pratt, J., and Schlaifer, R. (forthcoming), “Repetitive assessment of judgmental probability distributions: a case study”, in Proceedings of the Second International Meeting on Bayesian Statistics, Valencia, Spain, 1983.Google Scholar
Putnam, H. (1981), Reason, Truth and History. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Rao, C. R. (1980), “Diversity and Dissimilarity Coefficients: A unified approach,” Technical Report 80~-10, Institute for Statistics and Applications, Dept. of Mathematics and Statistics, University of Pittsburgh.Google Scholar
Sanders, F. (1958), “The evaluation of subjective probability forecasts”, Dept. of Meteorology, Contract AF 19(604)-1305, Scientific Report 5. Cambridge: MIT.Google Scholar
Savage, L. J. (1954), The Foundations of Statistics. New York: John Wiley.Google Scholar
Savage, L. J. (1971), “Elicitation of Personal Probabilities and Expectations”, Journal of the American Statistical Association 66; 783801.CrossRefGoogle Scholar
Schervish, M. J. (1983), “A General Method for Comparing Probability Assessors”, Technical Report 275, Dept. of Statistics. Pittsburgh: Carnegie-Mellon University.Google Scholar
Schervish, M., Seidenfeld, T.; and Kadane, J. (1984), “The Extent of Non-conglomerability of Finitely Additive Probabilities”, Z. Wahrscheinlichkeitstheorie 66: 205–26.CrossRefGoogle Scholar
Seidenfeld, T. (1978), “Direct Inference and Inverse Inference”, Journal of Philosophy 75: 709–30.CrossRefGoogle Scholar
Seidenfeld, T., and Schervish, M. (1983), “A Conflict Between Finite Additivity and Avoiding Dutch Book”, Philosophy of Science 50: 398412.CrossRefGoogle Scholar
Shimony, A. (1955), “Coherence and the Axioms of Confirmation”, Journal of Symbolic Logic 20: 128.CrossRefGoogle Scholar
Spielman, S. (1976), “Exchangeability and the Certainty of Objective Randomness,” Journal of Philosophical Logic 5: 399406.Google Scholar
Winkler, R. L. (1967), “The Assessment of Prior Distributions in Bayesian Analysis”, Journal of the American Statistical Association 62: 776–800.CrossRefGoogle Scholar
Zeckhauser, R. (1973), “Voting Systems, Honest Preferences and Pareto Optimality”, American Political Science Review 67: 934–46.CrossRefGoogle Scholar