MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

Wesley Cowan; Michael N. Katehakis

doi:10.1017/S0269964814000217

MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

Published online by Cambridge University Press: 10 October 2014

Wesley Cowan and

Michael N. Katehakis

Show author details

Wesley Cowan: Affiliation:
Department of Mathematics, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854, USA E-mail: [email protected]
Michael N. Katehakis: Affiliation:
Department of Management Science and Information Systems, Rutgers Business School, Newark and New Brunswick, 100 Rockafeller Road, Piscataway, NJ 08854, USA E-mail: [email protected]

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β∈(0, 1) per round.

In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as “decisions over state space” but rather “decisions over time”.

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 29 , Issue 1 , January 2015 , pp. 51 - 76

DOI: https://doi.org/10.1017/S0269964814000217 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

References

1.Aalto, S., Ayesta, U. & Righter, R. (2011). Properties of the Gittins index with application to optimal scheduling. Probability in the Engineering and Informational Sciences 25: 269–288.Google Scholar

2.Agmon, N., Kraus, S. & Kaminka, G.A. (2008). Multi-robot perimeter patrol in adversarial settings. In 2008 IEEE International Conference on Robotics and Automation (ICRA 2008), pp. 2339–2345, Pasadena, CA: IEEE.Google Scholar

3.Agrawal, R, Hegde, M. & Teneketzis, D. (1990). Multi-armed bandit problems with multiple plays and switching cost. Stochastics and Stochastic Reports 29: 437–459.Google Scholar

4.Bertsekas, D.P. (2011). Dynamic programming and optimal control, vol. II, 3rd ed.Belmont, MA: Athena Scientific.Google Scholar

5.Bubeck, S. & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv:1204.5721.CrossRef Google Scholar

6.Burnetas, A.N. & Katehakis, M.N. (1997). Optimal adaptive policies for Markovian decision processes. Mathematics of Operations Research 22: 222–255.Google Scholar

7.Burnetas, A.N. & Katehakis, M.N. (2002). Asymptotic Bayes analysis for the finite horizon one armed bandit problem. Probability in the Engineering and Informational Science 17: 157–161.Google Scholar

8.Burnetas, A.N. & Katehakis, M.N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17: 122–142.Google Scholar

9.Caro, F. & Yoo, O.S. (2010). Indexability of bandit problems with response delays. Probability in the Engineering and Informational Sciences 24: 349–374.Google Scholar

10.Chang, F. & Lai, T.L. (1987). Optimal stopping and dynamic allocation. Advances in Applied Probability 19: 829–53.CrossRef Google Scholar

11.Chung, K.L. (1982). Lectures from Markov processes to Brownian motion. Berlin: Springer-Verlag.Google Scholar

12.Denardo, E.V., Feinberg, E.A. & Rothblum, U.G. (2013). The multi-armed bandit, with constraints. In Katehakis, M.N., Ross, S.M., and Yang, J., (eds.), Cyrus Derman Memorial Volume I: Optimization under Uncertainty: Costs, Risks and Revenues, Annals of Operations Research, New York, NY: Springer.Google Scholar

13.Derman, C. & Sacks, J. (1960). Replacement of periodically inspected equipment (an optimal optional stopping rule). Naval Research Logistics Quarterly 7: 597–607.Google Scholar

14.El Karoui, N. & Karatzas, I. (1993). General Gittins index processes in discrete time. Proceedings of the National Academy of Sciences 90: 1232–1236.Google Scholar

15.Fernández-Gaucherand, E., Arapostathis, A. & Marcus, S.I. (1993). Analysis of an adaptive control scheme for a partially observed controlled Markov chain. IEEE Transactions on Automatic Control 38: 987–993.Google Scholar

16.Filippi, S., Cappé, O. & Garivier, A. (2010). Optimism in reinforcement learning and Kullback–Leibler divergence. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing, pp. 115–122, Monticello, IL: IEEE.Google Scholar

17.Flint, M., Fernández, E. & Kelton, W.D. (2009). Simulation analysis for uav search algorithm design using approximate dynamic programming. Military Operations Research 14: 41–50.Google Scholar

18.Frostig, E. & Weiss, G. (2014). Four proofs of Gittins’ multiarmed bandit theorem. In Katehakis, M.N., Ross, S.M., and Yang, J., (eds.), Cyrus Derman Memorial Volume II: Optimization under Uncertainty: Costs, Risks and Revenues, Annals of Operations Research, New York, NY: Springer.Google Scholar

19.Gittins, J.C., Glazebrook, K.D. & Weber, R.R. (2011). Multi-armed bandit allocation indices. West Sussex, UK: Wiley.Google Scholar

20.Gittins, J.C. & Jones, D.M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., (ed.), Progress in statistics, pp. 241–66, Amsterdam, NL: North-Holland. Read at the 1972 European Meeting of Statisticians, Budapest.Google Scholar

21.Gittins, J.C. (1979). Bandit processes and dynamic allocation indices (with discussion). Journal of Royal Statistics Society, Series B 41: 335–340.Google Scholar

22.Gittins, J.C. (1989). Multi-armed bandit allocation indices. Chichester: Wiley.Google Scholar

23.Glazebrook, K.D., Hodge, D.J. & Kirkbride, C. (2011). General notions of indexability for queueing control and asset management. The Annals of Applied Probability 21: 876–907.Google Scholar

24.Glazebrook, K.D., Kirkbride, C., Mitchell, H.M., Gaver, D.P. & Jacobs, P.A. (2007). Index policies for shooting problems. Operations Research 55: 769–781.Google Scholar

25.Govindarajulu, Z. & Katehakis, M.N. (1991). Dynamic allocation in survey sampling. American Journal of Mathematical and Management Sciences 11: 199–199.Google Scholar

26.Honda, J. & Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In COLT, pp. 67–79.Google Scholar

27.Ishikida, T. & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications 83: 113–154.Google Scholar

28.Kaspi, H. & Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. The Annals of Applied Probability 8: 1270–1290.CrossRef Google Scholar

29.Katehakis, M.N. & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. Lecture Notes-Monograph Series, 8: 29–39.CrossRef Google Scholar

30.Katehakis, M.N. & Rothblum, U.G. (1996). Finite state multi-armed bandit problems: sensitive-discount, average-reward and average-overtaking optimality. The Annals of Applied Probability 6: 1024–1034.Google Scholar

31.Katehakis, M.N., Olkin, I., Ross, S.M. & Yang, J. (2013). On the life and work of Cyrus Derman. Annals of Operations Research, 208: 1–22.Google Scholar

32.Katehakis, M.N. & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America 92: 8584–8585.Google Scholar

33.Katehakis, M.N. & Veinott, A.F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research 12: 262–268.Google Scholar

34.Lai, L., El Gamal, H., Jiang, H. & Poor, V.H. (2008). Optimal medium access protocols for cognitive radio networks. In 6th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks and Workshops.Google Scholar

35.Lai, T.L. & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6: 4–22.CrossRef Google Scholar

36.Liu, K., Zhao, Q. & Krishnamachari, B. (2010). Dynamic multichannel access with imperfect channel state detection. IEEE Transactions on Signal Processing 58: 2795–2808.Google Scholar

37.Mahajan, A. & Teneketzis, D. (2008). Multi-armed bandit problems. In Hero, A.O. III, Castanon, D.A., Cochran, D., and Kastella, K. (eds.), Foundations and Applications of Sensor Management, pp. 121–151, New York, NY: Springer.Google Scholar

38.Niño-Mora, J. (2006). Restless bandit marginal productivity indices, diminishing returns, and optimal control of make-to-order/make-to-stock M/G/1 queues. Mathematical Methods of Operational Research 31: 50–84.Google Scholar

39.Oksanen, J., Koivunen, V. & Poor, H.V. (2012). A sensing policy based on confidence bounds and a restless multi-armed bandit model. In 2012 Conference Record of the Forty-Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 318–323, Pacific Grove, CA: IEEE.Google Scholar

40.Ortner, P. & Auer, R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. In Proceedings of the 2006 Conference on Advances in Neural Information Processing Systems 19, volume 19, p. 49, Vancouver, BC: MIT Press.Google Scholar

41.Ouyang, Y. & Teneketzis, D. (2013). On the optimality of myopic sensing in multi-state channels. arXiv:1305.6993.Google Scholar

42.Snell, J.L. (1952) Applications of martingale system theorems. Transactions of the American Mathematical Society 73: 293–312.Google Scholar

43.Sonin, I.M. (2008). A generalized Gittins index for a Markov chain and its recursive calculation. Statistics and Probability Letters 78: 1526–1533.Google Scholar

44.Sonin, I.M. (2011). Optimal stopping of Markov chains and three abstract optimization problems. Stochastics 83: 405–414.Google Scholar

45.Steinberg, C. & Sonin, I. (2014). Continue, quit, restart probability model. In Katehakis, M.N., Ross, S.M., and Yang, J. (eds.), Cyrus Derman Memorial Volume II: Optimization under Uncertainty: Costs, Risks and Revenues, Annals of Operations Research, Springer.Google Scholar

46.Su, H., Qiu, M. & Wang, H. (2012). Secure wireless communication system for smart grid with rechargeable electric vehicles. IEEE Communications Magazine 50: 62–68.Google Scholar

47.Tekin, C. & Liu, M. (2011). Optimal adaptive learning in uncontrolled restless bandit problems. arXiv:1107.4042.Google Scholar

48.Tewari, A. & Bartlett, P.L. (2007). Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Platt, J.C., Koller, D., Singer, Y. and Roweis, S.T. (eds.), Advances in Neural Information Processing Systems, pp. 1505–1512.Google Scholar

49.Tsitsiklis, J.N. (1994). A short proof of the Gittins index theorem. The Annals of Applied Probability, 27: 194–199.Google Scholar

50.Varaiya, P., Walrand, J. & Buyukkoc, C. (1985). Extensions of the multiarmed bandit problem: the discounted case. IEEE Transactions on Automatic Control, 30: 426–439.Google Scholar

51.Weber, R.R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability 1024–1033.Google Scholar

52.Weber, R.R. & Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability 637–648.Google Scholar

Article contents

MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

Abstract

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests