Hostname: page-component-cd9895bd7-7cvxr Total loading time: 0 Render date: 2024-12-23T11:10:38.341Z Has data issue: false hasContentIssue false

ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT

Published online by Cambridge University Press:  05 October 2016

Apostolos Burnetas
Affiliation:
Department of Mathematics, University of Athens, Athens, Greece E-mail: [email protected]
Odysseas Kanavetas
Affiliation:
Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey E-mail: [email protected]
Michael N. Katehakis
Affiliation:
Department of Management Science and Information Systems, Rutgers University, NJ, USA E-mail: [email protected]

Abstract

We consider the multi-armed bandit problem under a cost constraint. Successive samples from each population are i.i.d. with unknown distribution and each sample incurs a known population-dependent cost. The objective is to design an adaptive sampling policy to maximize the expected sum of n samples such that the average cost does not exceed a given bound sample-path wise. We establish an asymptotic lower bound for the regret of feasible uniformly fast convergent policies, and construct a class of policies, which achieve the bound. We also provide their explicit form under Normal distributions with unknown means and known variances.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Audibert, J.-Y., Munos, R., & Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19): 18761902.CrossRefGoogle Scholar
2. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2–3): 235256.Google Scholar
3. Auer, P. and Ortner, R. (2010). Ucb revisited: improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1–2): 5565.Google Scholar
4. Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013). Bandits with knapsacks. Foundations of Computer Science (FOCS). In 2013 IEEE 54th Annual Symposium on IEEE, pp. 207216.Google Scholar
5. Bartlett, P.L. & Tewari, A. (2009). Regal: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, pp. 3542.Google Scholar
6. Bubeck, S. & Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. arXiv:1202.4473.Google Scholar
7. Burnetas, A.N. & Kanavetas, O.A. (2012). Adaptive policies for sequential sampling under incomplete information and a cost constraint. In Daras, N.J. (ed.), Applications of mathematics and informatics in military science, Springer, pp. 97112.CrossRefGoogle Scholar
8. Burnetas, A.N. & Katehakis, M.N. (1993). On sequencing two types of tasks on a single processor under incomplete information. Probability in the Engineering and Informational Sciences 7(1): 85119.Google Scholar
9. Burnetas, A.N. & Katehakis, M.N. (1996). On large deviations properties of sequential allocation problems. Stochastic Analysis and Applications 14(1): 2331.Google Scholar
10. Burnetas, A.N. & Katehakis, M.N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17(2): 122142.Google Scholar
11. Burnetas, A.N. & Katehakis, M.N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research 22(1): 222255.Google Scholar
12. Burnetas, A.N. & Katehakis, M.N. (1998). Sequential allocation problems with side constraints. INFORMS Seattle 1998, Annual Meeting, Seattle, WA.Google Scholar
13. Burnetas, A.N. & Katehakis, M.N. (2003). Asymptotic Bayes analysis for the finite-horizon one-armed-bandit problem. Probability in the Engineering and Informational Sciences 17(01): 5382.Google Scholar
14. Butenko, S., Murphey, R., & Pardalos, P.M. (eds.). (2013). Cooperative control: models, applications and algorithms (Vol. 1). Springer Science & Business Media, New York.Google Scholar
15. Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., & Stoltz, G. (2013). Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics 41(3): 15161541.Google Scholar
16. Cowan, W., Honda, J., & Katehakis, M.N. (2015). Asymptotic optimality, finite horizon regret bounds, and a solution to an open problem. arXiv:1504.05823. Journal of Machine Learning Research, to appear.Google Scholar
17. Cowan, W. & Katehakis, M.N. (2015). Asymptotic behavior of minimal-exploration allocation policies: almost sure, arbitrarily slow growing regret. arXiv:1505.02865.Google Scholar
18. Cowan, W. & Katehakis, M.N. (2015). An asymptotically optimal UCB policy for uniform bandits of unknown support. arXiv:1505.01918.Google Scholar
19. Cowan, W. & Katehakis, M.N. (2015 c). Multi-armed bandits under general depreciation and commitment. Probability in the Engineering and Informational Sciences 29(01): 5176.Google Scholar
20. Dayanik, S., Powell, W.B., & Yamazaki, K. (2013). Asymptotically optimal Bayesian sequential change detection and identification rules. Annals of Operations Research 208(1): 337370.Google Scholar
21. Ding, W., Qin, T., Zhang, X.-D., & Liu, T.-Y. (2013). Multi-armed bandit with budget constraint and variable costs. In AAAI-13 Conference, pp. 232238.Google Scholar
22. Feinberg, E.A., Kasyanov, P.O., & Zgurovsky, M.Z. (2014). Convergence of value iterations for total-cost MDPs and POMDPs with general state and action sets. Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). In 2014 IEEE Symposium on IEEE, pp. 1–8.Google Scholar
23. Feller, W. (1967). An introduction to probability theory and its applications, Vol. 1; 3rd ed, Wiley, New York.Google Scholar
24. Filippi, S., Cappé, O., & Garivier, A. (2010). Optimism in reinforcement learning based on Kullback–Leibler divergence. In 48th Annual Allerton Conference on Communication, Control, and Computing, pp. 115122.Google Scholar
25. Gittins, J.C., Glazebrook, K., & Weber, R.R. (2011). Multi-armed Bandit allocation indices, West Sussex, UK: John Wiley & Sons.Google Scholar
26. Guha, S. & Munagala, K. (2007). Approximation algorithms for budgeted learning problems. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, ACM, pp. 104113.Google Scholar
27. Honda, J. & Takemura, A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning 85(3): 361391.Google Scholar
28. Johnson, K., Simchi-Levi, D., & Wang, H. (2015). Online network revenue management using Thompson sampling. Available at SSRN.Google Scholar
29. Jouini, W., Ernst, D., Moy, C., & Palicot, J. (2009). Multi-armed bandit based policies for cognitive radio's decision making issues. In Third International Conference on Signals, Circuits and Systems (SCS), pp. 16.Google Scholar
30. Katehakis, M.N. & Derman, C. (1986). Computing optimal sequential allocation rules. Clinical Trials. Vol. 8; Lecture Note Series: Adoptive Statistical Procedures and Related Topics, Institute of Mathematical Statistics, pp. 2939.Google Scholar
31. Katehakis, M.N. & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America 92(19): 8584.Google Scholar
32. Katehakis, M.N. & Veinott, A.F. Jr. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research 12: 262268.Google Scholar
33. Kaufmann, E. (2015). Analyse de stratégies Bayésiennes et fréquentistes pour l'allocation séquentielle de ressources. Doctorat, ParisTech.Google Scholar
34. Kleinberg, R.D. (2004). Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems Conference. pp. 697704.Google Scholar
35. Lagoudakis, M.G. & Parr, R. (2003). Least-squares policy iteration. The Journal of Machine Learning Research 4: 11071149.Google Scholar
36. Lai, T.L. & Robbins, H. (1985). ‘Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1): 4–2.Google Scholar
37. Lattimore, T., Crammer, K., & Szepesvári, C. (2014). Optimal resource allocation with semi-Bandit feedback. arXiv:1406.3840.Google Scholar
38. Li, L., Munos, R., & Szepesvári, C. (2014). On minimax optimal offline policy evaluation. arXiv:1409.3653.Google Scholar
39. Littman, M.L. (2012). Inducing partially observable Markov decision processes. In ICGI Conference, pp. 145148.Google Scholar
40. Mahajan, A. & Teneketzis, D. (2008). Multi-armed bandit problems. In Hero, A.O., Castanon, D., Cocharn, D., & Kastella, K. (eds.), Foundations and applications of sensor management, Springer, pp. 121151.Google Scholar
41. Osband, I. & Van Roy, B. (2014). Near-optimal reinforcement learning in factored MDPs. In Advances in Neural Information Processing Systems Conference. pp. 604612.Google Scholar
42. Sen, S., Ridgway, A., & Ripley, M. (2015). Adaptive budgeted bandit algorithms for trust development in a supply-chain. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems. pp. 137144.Google Scholar
43. Singla, A. & Krause, A. (2013). Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, pp. 11671178.Google Scholar
44. Tekin, C. & Liu, M. (2012). Approximately optimal adaptive learning in opportunistic spectrum access. INFOCOM, 2012 Proceedings IEEE. IEEE, pp. 15481556.Google Scholar
45. Tewari, A. & Bartlett, P.L. (2008). Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Advances in Neural Information Processing Systems Conference. pp. 15051512.Google Scholar
46. Thomaidou, S., Vazirgiannis, M., & Liakopoulos, K. (2012). Toward an integrated framework for automated development and optimization of online advertising campaigns. arXiv:1208.1187.Google Scholar
47. Tran-Thanh, L., Chapman, A., Luna, M.D.C.F., Enrique, J., Rogers, A., & Jennings, N.R. (2010). Epsilon – first policies for budget – limited multi-armed bandits. In AAAI-2010 Conference, pp. 12111216.Google Scholar
48. Tran-Thanh, L., Chapman, A., Luna, M.D.C.F., Enrique, J., Rogers, A., & Jennings, N.R. (2012). Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI-2012 Conference, pp. 11341140.Google Scholar
49. Tran-Thanh, L., Stavrogiannis, L.C., Naroditskiy, V., Robu, V., Jennings, N.R., & Key, P. (2014). Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions’. University of Southampton, UK, Technical Report.Google Scholar
50. Wang, Z., Deng, S., & Ye, Y. (2014). Close the gaps: a learning-while-doing algorithm for single-product revenue management problems. Operations Research 62(2): 318331.Google Scholar