Optimal learning with non-Gaussian rewards

Zi Ding; Ilya O. Ryzhov

doi:10.1017/apr.2015.9

Optimal learning with non-Gaussian rewards

Part of: Markov processes Stochastic processes

Published online by Cambridge University Press: 24 March 2016

Zi Ding and

Ilya O. Ryzhov

Show author details

Zi Ding*: Affiliation:
University of Maryland
Ilya O. Ryzhov*: Affiliation:
University of Maryland
*: * Postal address: Robert H. Smith School of Business, University of Maryland, 4322 Van Munching Hall, College Park, MD 20742, USA.
* Postal address: Robert H. Smith School of Business, University of Maryland, 4322 Van Munching Hall, College Park, MD 20742, USA.

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We propose a novel theoretical characterization of the optimal 'Gittins index' policy in multi-armed bandit problems with non-Gaussian, infinitely divisible reward distributions. We first construct a continuous-time, conditional Lévy process which probabilistically interpolates the sequence of discrete-time rewards. When the rewards are Gaussian, this approach enables an easy connection to the convenient time-change properties of a Brownian motion. Although no such device is available in general for the non-Gaussian case, we use optimal stopping theory to characterize the value of the optimal policy as the solution to a free-boundary partial integro-differential equation (PIDE). We provide the free-boundary PIDE in explicit form under the specific settings of exponential and Poisson rewards. We also prove continuity and monotonicity properties of the Gittins index in these two problems, and discuss how the PIDE can be solved numerically to find the optimal index value of a given belief state.

Keywords

Gittins indices optimal learning multi-armed bandit non-Gaussian rewards probabilistic interpolation

MSC classification

Primary: 60G40: Stopping times; optimal stopping problems; gambling theory

Secondary: 60J75: Jump processes

Type: Research Article
Information: Advances in Applied Probability , Volume 48 , Issue 1 , March 2016 , pp. 112 - 136

DOI: https://doi.org/10.1017/apr.2015.9 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Aalto, S., Ayesta, U. and Righter, R. (2011). Properties of the Gittins index with application to optimal scheduling. Prob. Eng. Inf. Sci. 25, 269–288. CrossRef Google Scholar

Agarwal, D., Chen, B.-C. and Elango, P. (2009). Explore/exploit schemes for web content optimization. In Proceedings of the 9th IEEE International Conference on Data Mining, IEEE, New York, pp. 1–10. Google Scholar

Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, 235–256. CrossRef Google Scholar

Berry, D. A. and Pearson, L. M. (1985). Optimal designs for clinical trials with dichotomous responses. Statist. Medicine 4, 497–508. CrossRef Google Scholar PubMed

Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. J. Econom. Dynamics Control 27, 87–108. Google Scholar

Buonaguidi, B. and Muliere, P. (2013). Sequential testing problems for Lévy processes. Sequential Anal. 32, 47–70. CrossRef Google Scholar

Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53, 276–292. CrossRef Google Scholar

Chhabra, M. and Das, S. (2011). Learning the demand curve in posted-price digital goods auctions. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, pp. 63–70. Google Scholar

Chick, S. E. (2006). Subjective probability and Bayesian methodology. In Handbooks in Operations Research and Management Science, Vol. 13, Simulation, North-Holland, Amsterdam, pp. 225–258. Google Scholar

Chick, S. E. and Frazier, P. I. (2012). Sequential sampling with economics of selection procedures. Manag. Sci. 58, 550–569. Google Scholar

Chick, S. E. and Gans, N. (2009). Economic analysis of simulation selection problems. Manag. Sci. 55, 421–437. Google Scholar

Chick, S. E. and Inoue, K. (2001). New procedures to select the best simulated system using common random numbers. Manag. Sci. 47, 1133–1149. Google Scholar

Çinlar, E. (2003). Conditional Lévy processes. Comput. Math. Appl. 46, 993–997. Google Scholar

Çinlar, E. (2011). Probability and Stochastics. Springer, New York. CrossRef Google Scholar

Cohen, A. and Solan, E. (2013). Bandit problems with Lévy processes. Math. Operat. Res. 38, 92–107. CrossRef Google Scholar

Coquet, F. and Toldo, S. (2007). Convergence of values in optimal stopping and convergence of optimal stopping times. Electron. J. Prob. 12, 207–228. Google Scholar

DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York. Google Scholar

Dynkin, E. B. (1965). Markov Processes. Academic Press, New York. Google Scholar

El Karoui, N. and Karatzas, I. (1994). Dynamic allocation problems in continuous time. Ann. Appl. Prob. 4, 255–286. Google Scholar

Farias, V. F. and Van Roy, B. (2010). Dynamic pricing with a prior on market response. Operat. Res. 58, 16–29. Google Scholar

Filliger, R. and Hongler, M.-O. (2007). Explicit Gittins indices for a class of superdiffusive processes. J. Appl. Prob. 44, 554–559. CrossRef Google Scholar

Frazier, P. I. and Powell, W. B. (2011). Consistency of sequential Bayesian sampling policies. SIAM J. Control Optimization 49, 712–731. Google Scholar

Frazier, P. I., Powell, W. B. and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM J. Control Optimization 47, 2410–2439. CrossRef Google Scholar

Gittins, J. C. and Jones, D. M. (1979). A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika 66, 561–565. Google Scholar

Gittins, J. C. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20, 1625–1636. Google Scholar

Gittins, J. C., Glazebrook, K. D. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd edn. John Wiley, Oxford. Google Scholar

Glazebrook, K. D. and Minty, R. (2009). A generalized Gittins index for a class of multiarmed bandits with general resource requirements. Math. Operat. Res. 34, 26–44. CrossRef Google Scholar

Glazebrook, K. D., Meissner, J. and Schurr, J. (2013). How big should my store be? On the interplay between shelf-space, demand learning and assortment decisions. Working paper, Lancaster University. Google Scholar

Itô, K., Barndorff-Nielsen, O. E. and Sato, K.-I. (2004). Stochastic Processes: Lectures Given at Aarhus University. Springer, Berlin. Google Scholar

Jouini, W. and Moy, C. (2012). Channel selection with Rayleigh fading: a multi-armed bandit framework. In Proceedings of the 13th IEEE International Workshop on Signal Processing Advances in Wireless Communications, IEEE, New York, pp. 299–303. Google Scholar

Kaspi, H. and Mandelbaum, A. (1995). Lévy bandits: multi-armed bandits driven by Lévy processes. Ann. Appl. Prob. 5, 541–565. Google Scholar

Katehakis, M. N. and VeinottA. F., Jr. A. F., Jr. (1987). The multi-armed bandit problem: decomposition and computation. Math. Operat. Res. 12, 262–268. CrossRef Google Scholar

Kyprianou, A. E. (2006). Introductory Lectures on Fluctuations of Lévy Processes with Applications. Springer, Berlin. Google Scholar

Lamberton, D. and Pagès, G. (1990). Sur l'approximation des réduites. Ann. Inst. H. Poincaré Prob. Statist. 26, 331–355. Google Scholar

Lariviere, M. A. and Porteus, E. L. (1999). Stalking information: Bayesian inventory management with unobserved lost sales. Manag. Sci. 45, 346–363. CrossRef Google Scholar

Mandelbaum, A. (1986). Discrete multiarmed bandits and multiparameter processes. Prob. Theory Relat. Fields 71, 129–147. CrossRef Google Scholar

Mandelbaum, A. (1987). Continuous multi-armed bandits and multiparameter processes. Ann. Prob. 15, 1527–1556. Google Scholar

Monroe, I. (1978). Processes that can be embedded in Brownian motion. Ann. Prob. 6, 42–56. Google Scholar

Müller, A. (1997). How does the value function of a Markov decision process depend on the transition probabilities? Math. Operat. Res. 22, 872–885. CrossRef Google Scholar

Müller, A. and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks. John Wiley, Chichester. Google Scholar

Peskir, G. and Shiryaev, A. N. (2006). Optimal Stopping and Free-Boundary Problems. Birkhäuser, Basel. Google Scholar

Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning. John Wiley, Hoboken, NJ. Google Scholar

Ryzhov, I. O. and Powell, W. B. (2011). The value of information in multi-armed bandits with exponentially distributed rewards. In Proceedings of the 2011 International Conference on Computational Science, pp. 1363–1372. Google Scholar

Ryzhov, I. O., Powell, W. B. and Frazier, P. I. (2012). The knowledge gradient algorithm for a general class of online learning problems. Operat. Res. 60, 180–195. Google Scholar

Sato, K.-I. (1999). Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press. Google Scholar

Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders. Springer, New York. Google Scholar

Steele, J. M. (2001). Stochastic Calculus and Financial Applications. Springer, New York. Google Scholar

Van Moerbeke, P. (1976). On optimal stopping and free boundary problems. Arch. Rational Mech. Anal. 60, 101–148. Google Scholar

Vazquez, E. and Bect, J. (2010). Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Statist. Planning Infer. 140, 3088–3095. Google Scholar

Wang, X. and Wang, Y. (2010). Optimal investment and consumption with stochastic dividends. Appl. Stoch. Models Business Industry 26, 792–808. Google Scholar

Yao, Y.-C. (2006). Some results on the Gittins index for a normal reward process. In Time Series and Related Topics, Institute of Mathematical Statistics, Beachwood, OH, pp. 284–294. Google Scholar

Yu, Y. (2011). Structural properties of Bayesian bandits with exponential family distributions. Preprint. Available at http://arxiv.org/abs/1103.3089. Google Scholar

Zhang, Q., Seetharaman, P. B. and Narasimhan, C. (2012). The indirect impact of price deals on households' purchase decisions through the formation of expected future prices. J. Retailing 88, 88–101. Google Scholar

Article contents

Optimal learning with non-Gaussian rewards

Abstract

Keywords

MSC classification

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests