Hostname: page-component-586b7cd67f-dsjbd Total loading time: 0 Render date: 2024-11-22T20:28:49.742Z Has data issue: false hasContentIssue false

A class of bandit problems yielding myopic optimal strategies

Published online by Cambridge University Press:  14 July 2016

Jeffrey S. Banks
Affiliation:
University of Rochester
Rangarajan K. Sundaram*
Affiliation:
University of Rochester
*
Postal address for both authors: Department of Economics, Harkness Hall, University of Rochester, Rochester, NY 14627, USA.

Abstract

We consider the class of bandit problems in which each of the n ≧ 2 independent arms generates rewards according to one of the same two reward distributions, and discounting is geometric over an infinite horizon. We show that the dynamic allocation index of Gittins and Jones (1974) in this context is strictly increasing in the probability that an arm is the better of the two distributions. It follows as an immediate consequence that myopic strategies are the uniquely optimal strategies in this class of bandit problems, regardless of the value of the discount parameter or the shape of the reward distributions. Some implications of this result for bandits with Bernoulli reward distributions are given.

Type
Research Papers
Copyright
Copyright © Applied Probability Trust 1992 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Financial support from the National Science Foundation and the Sloan Foundation to the first author is gratefully acknowledged.

References

Berry, D. and Fristedt, B. (1985) Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London.Google Scholar
Feldman, D. (1962) Contributions to the two-armed bandit problem. Ann. Math. Statist. 33, 847856.CrossRefGoogle Scholar
Feller, W. (1968) An Introduction to Probability Theory and Its Applications, Vol. I. Wiley, New York.Google Scholar
Fristedt, B. and Berry, D. (1988) Optimality of myopic stopping times for geometric discounting. J. Appl. Prob. 25, 437443.CrossRefGoogle Scholar
Gittins, J. and Jones, D. (1974) A dynamic allocation index for the sequential design of experiments. In Progress in Statistics, ed. Gani, J. et al., pp. 241266. North-Holland, Amsterdam.Google Scholar
McLennan, A. (1988) Incomplete learning in a repeated statistical decision problem. Mimeo, University of Minnesota.Google Scholar
O'Flaherty, B. (1989) Some results on two-armed bandits when both projects vary. J. Appl. Prob. 26, 655658.CrossRefGoogle Scholar
Rodman, L. (1978) On the many-armed bandit problem. Ann. Prob. 6, 491498.Google Scholar
Ross, S. (1983) Introduction to Dynamic Programming. Academic Press, New York.Google Scholar