Article contents
Policies without Memory for the Infinite-Armed Bernoulli Bandit under the Average-Reward Criterion
Published online by Cambridge University Press: 27 July 2009
Abstract
We consider a bandit problem with infinitely many Bernoulli arms whose unknown parameters are i.i.d. We present two policies that maximize the almost sure average reward over an infinite horizon. Neither policy ever returns to a previously observed arm after switching to a new one or retains information from discarded arms, and runs of failures indicate the selection of a new arm. The first policy is nonstationary and requires no information about the distribution of the Bernoulli parameter. The second is stationary and requires only partial information; its optimality is established via renewal theory. We also develop ε-optimal stationary policies that require no information about the distribution of the unknown parameter and discuss universally optimal stationary policies.
- Type
- Research Article
- Information
- Probability in the Engineering and Informational Sciences , Volume 10 , Issue 1 , January 1996 , pp. 21 - 28
- Copyright
- Copyright © Cambridge University Press 1996
References
- 6
- Cited by