Hostname: page-component-cd9895bd7-7cvxr Total loading time: 0 Render date: 2024-12-23T17:42:19.841Z Has data issue: false hasContentIssue false

MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

Published online by Cambridge University Press:  10 October 2014

Wesley Cowan
Affiliation:
Department of Mathematics, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854, USA E-mail: [email protected]
Michael N. Katehakis
Affiliation:
Department of Management Science and Information Systems, Rutgers Business School, Newark and New Brunswick, 100 Rockafeller Road, Piscataway, NJ 08854, USA E-mail: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β∈(0, 1) per round.

In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as “decisions over state space” but rather “decisions over time”.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2014 

References

1.Aalto, S., Ayesta, U. & Righter, R. (2011). Properties of the Gittins index with application to optimal scheduling. Probability in the Engineering and Informational Sciences 25: 269288.Google Scholar
2.Agmon, N., Kraus, S. & Kaminka, G.A. (2008). Multi-robot perimeter patrol in adversarial settings. In 2008 IEEE International Conference on Robotics and Automation (ICRA 2008), pp. 23392345, Pasadena, CA: IEEE.Google Scholar
3.Agrawal, R, Hegde, M. & Teneketzis, D. (1990). Multi-armed bandit problems with multiple plays and switching cost. Stochastics and Stochastic Reports 29: 437459.Google Scholar
4.Bertsekas, D.P. (2011). Dynamic programming and optimal control, vol. II, 3rd ed.Belmont, MA: Athena Scientific.Google Scholar
5.Bubeck, S. & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv:1204.5721.CrossRefGoogle Scholar
6.Burnetas, A.N. & Katehakis, M.N. (1997). Optimal adaptive policies for Markovian decision processes. Mathematics of Operations Research 22: 222255.Google Scholar
7.Burnetas, A.N. & Katehakis, M.N. (2002). Asymptotic Bayes analysis for the finite horizon one armed bandit problem. Probability in the Engineering and Informational Science 17: 157161.Google Scholar
8.Burnetas, A.N. & Katehakis, M.N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17: 122142.Google Scholar
9.Caro, F. & Yoo, O.S. (2010). Indexability of bandit problems with response delays. Probability in the Engineering and Informational Sciences 24: 349374.Google Scholar
10.Chang, F. & Lai, T.L. (1987). Optimal stopping and dynamic allocation. Advances in Applied Probability 19: 829–53.CrossRefGoogle Scholar
11.Chung, K.L. (1982). Lectures from Markov processes to Brownian motion. Berlin: Springer-Verlag.Google Scholar
12.Denardo, E.V., Feinberg, E.A. & Rothblum, U.G. (2013). The multi-armed bandit, with constraints. In Katehakis, M.N., Ross, S.M., and Yang, J., (eds.), Cyrus Derman Memorial Volume I: Optimization under Uncertainty: Costs, Risks and Revenues, Annals of Operations Research, New York, NY: Springer.Google Scholar
13.Derman, C. & Sacks, J. (1960). Replacement of periodically inspected equipment (an optimal optional stopping rule). Naval Research Logistics Quarterly 7: 597607.Google Scholar
14.El Karoui, N. & Karatzas, I. (1993). General Gittins index processes in discrete time. Proceedings of the National Academy of Sciences 90: 12321236.Google Scholar
15.Fernández-Gaucherand, E., Arapostathis, A. & Marcus, S.I. (1993). Analysis of an adaptive control scheme for a partially observed controlled Markov chain. IEEE Transactions on Automatic Control 38: 987993.Google Scholar
16.Filippi, S., Cappé, O. & Garivier, A. (2010). Optimism in reinforcement learning and Kullback–Leibler divergence. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing, pp. 115122, Monticello, IL: IEEE.Google Scholar
17.Flint, M., Fernández, E. & Kelton, W.D. (2009). Simulation analysis for uav search algorithm design using approximate dynamic programming. Military Operations Research 14: 4150.Google Scholar
18.Frostig, E. & Weiss, G. (2014). Four proofs of Gittins’ multiarmed bandit theorem. In Katehakis, M.N., Ross, S.M., and Yang, J., (eds.), Cyrus Derman Memorial Volume II: Optimization under Uncertainty: Costs, Risks and Revenues, Annals of Operations Research, New York, NY: Springer.Google Scholar
19.Gittins, J.C., Glazebrook, K.D. & Weber, R.R. (2011). Multi-armed bandit allocation indices. West Sussex, UK: Wiley.Google Scholar
20.Gittins, J.C. & Jones, D.M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., (ed.), Progress in statistics, pp. 241–66, Amsterdam, NL: North-Holland. Read at the 1972 European Meeting of Statisticians, Budapest.Google Scholar
21.Gittins, J.C. (1979). Bandit processes and dynamic allocation indices (with discussion). Journal of Royal Statistics Society, Series B 41: 335340.Google Scholar
22.Gittins, J.C. (1989). Multi-armed bandit allocation indices. Chichester: Wiley.Google Scholar
23.Glazebrook, K.D., Hodge, D.J. & Kirkbride, C. (2011). General notions of indexability for queueing control and asset management. The Annals of Applied Probability 21: 876907.Google Scholar
24.Glazebrook, K.D., Kirkbride, C., Mitchell, H.M., Gaver, D.P. & Jacobs, P.A. (2007). Index policies for shooting problems. Operations Research 55: 769781.Google Scholar
25.Govindarajulu, Z. & Katehakis, M.N. (1991). Dynamic allocation in survey sampling. American Journal of Mathematical and Management Sciences 11: 199–199.Google Scholar
26.Honda, J. & Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In COLT, pp. 6779.Google Scholar
27.Ishikida, T. & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications 83: 113154.Google Scholar
28.Kaspi, H. & Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. The Annals of Applied Probability 8: 12701290.CrossRefGoogle Scholar
29.Katehakis, M.N. & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. Lecture Notes-Monograph Series, 8: 2939.CrossRefGoogle Scholar
30.Katehakis, M.N. & Rothblum, U.G. (1996). Finite state multi-armed bandit problems: sensitive-discount, average-reward and average-overtaking optimality. The Annals of Applied Probability 6: 10241034.Google Scholar
31.Katehakis, M.N., Olkin, I., Ross, S.M. & Yang, J. (2013). On the life and work of Cyrus Derman. Annals of Operations Research, 208: 122.Google Scholar
32.Katehakis, M.N. & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America 92: 85848585.Google Scholar
33.Katehakis, M.N. & Veinott, A.F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research 12: 262268.Google Scholar
34.Lai, L., El Gamal, H., Jiang, H. & Poor, V.H. (2008). Optimal medium access protocols for cognitive radio networks. In 6th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks and Workshops.Google Scholar
35.Lai, T.L. & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6: 422.CrossRefGoogle Scholar
36.Liu, K., Zhao, Q. & Krishnamachari, B. (2010). Dynamic multichannel access with imperfect channel state detection. IEEE Transactions on Signal Processing 58: 27952808.Google Scholar
37.Mahajan, A. & Teneketzis, D. (2008). Multi-armed bandit problems. In Hero, A.O. III, Castanon, D.A., Cochran, D., and Kastella, K. (eds.), Foundations and Applications of Sensor Management, pp. 121151, New York, NY: Springer.Google Scholar
38.Niño-Mora, J. (2006). Restless bandit marginal productivity indices, diminishing returns, and optimal control of make-to-order/make-to-stock M/G/1 queues. Mathematical Methods of Operational Research 31: 5084.Google Scholar
39.Oksanen, J., Koivunen, V. & Poor, H.V. (2012). A sensing policy based on confidence bounds and a restless multi-armed bandit model. In 2012 Conference Record of the Forty-Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 318323, Pacific Grove, CA: IEEE.Google Scholar
40.Ortner, P. & Auer, R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. In Proceedings of the 2006 Conference on Advances in Neural Information Processing Systems 19, volume 19, p. 49, Vancouver, BC: MIT Press.Google Scholar
41.Ouyang, Y. & Teneketzis, D. (2013). On the optimality of myopic sensing in multi-state channels. arXiv:1305.6993.Google Scholar
42.Snell, J.L. (1952) Applications of martingale system theorems. Transactions of the American Mathematical Society 73: 293312.Google Scholar
43.Sonin, I.M. (2008). A generalized Gittins index for a Markov chain and its recursive calculation. Statistics and Probability Letters 78: 15261533.Google Scholar
44.Sonin, I.M. (2011). Optimal stopping of Markov chains and three abstract optimization problems. Stochastics 83: 405414.Google Scholar
45.Steinberg, C. & Sonin, I. (2014). Continue, quit, restart probability model. In Katehakis, M.N., Ross, S.M., and Yang, J. (eds.), Cyrus Derman Memorial Volume II: Optimization under Uncertainty: Costs, Risks and Revenues, Annals of Operations Research, Springer.Google Scholar
46.Su, H., Qiu, M. & Wang, H. (2012). Secure wireless communication system for smart grid with rechargeable electric vehicles. IEEE Communications Magazine 50: 6268.Google Scholar
47.Tekin, C. & Liu, M. (2011). Optimal adaptive learning in uncontrolled restless bandit problems. arXiv:1107.4042.Google Scholar
48.Tewari, A. & Bartlett, P.L. (2007). Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Platt, J.C., Koller, D., Singer, Y. and Roweis, S.T. (eds.), Advances in Neural Information Processing Systems, pp. 15051512.Google Scholar
49.Tsitsiklis, J.N. (1994). A short proof of the Gittins index theorem. The Annals of Applied Probability, 27: 194199.Google Scholar
50.Varaiya, P., Walrand, J. & Buyukkoc, C. (1985). Extensions of the multiarmed bandit problem: the discounted case. IEEE Transactions on Automatic Control, 30: 426439.Google Scholar
51.Weber, R.R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability 10241033.Google Scholar
52.Weber, R.R. & Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability 637648.Google Scholar