INDEXABILITY AND OPTIMAL INDEX POLICIES FOR A CLASS OF REINITIALISING RESTLESS BANDITS

Sofía S. Villar

doi:10.1017/S026996481500025X

INDEXABILITY AND OPTIMAL INDEX POLICIES FOR A CLASS OF REINITIALISING RESTLESS BANDITS

Published online by Cambridge University Press: 16 October 2015

Sofía S. Villar

Show author details

Sofía S. Villar*: Affiliation:
BCAM, Basque Center for Applied Mathematics and MRC Biostatistics Unit, IPH, University Forvie Site, Robinson Way, Cambridge CB2 0SR, UKandLancaster University E-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Motivated by a class of Partially Observable Markov Decision Processes with application in surveillance systems in which a set of imperfectly observed state processes is to be inferred from a subset of available observations through a Bayesian approach, we formulate and analyze a special family of multi-armed restless bandit problems. We consider the problem of finding an optimal policy for observing the processes that maximizes the total expected net rewards over an infinite time horizon subject to the resource availability. From the Lagrangian relaxation of the original problem, an index policy can be derived, as long as the existence of the Whittle index is ensured. We demonstrate that such a class of reinitializing bandits in which the projects' state deteriorates while active and resets to its initial state when passive until its completion possesses the structural property of indexability and we further show how to compute the index in closed form. In general, the Whittle index rule for restless bandit problems does not achieve optimality. However, we show that the proposed Whittle index rule is optimal for the problem under study in the case of stochastically heterogenous arms under the expected total criterion, and it is further recovered by a simple tractable rule referred to as the 1-limited Round Robin rule. Moreover, we illustrate the significant suboptimality of other widely used heuristic: the Myopic index rule, by computing in closed form its suboptimality gap. We present numerical studies which illustrate for the more general instances the performance advantages of the Whittle index rule over other simple heuristics.

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 30 , Issue 1 , January 2016 , pp. 1 - 23

DOI: https://doi.org/10.1017/S026996481500025X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

1.Ahmad, S., Liu, M., Javidi, T., Zhao, Q., & Krishnamachari, B. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Transactions on Information Theory 55(9): 4040–4050.CrossRef Google Scholar

2.Gittins, J.C. & Jones, D.M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J. (ed.), Progress in Statistics. North–Holland, Amsterdam, pp. 241–266.Google Scholar

3.Gittins, J.C. & Jones, D.M. (1979). Bandit processes and dynamic allocation indices (with discussion). Journal of the Royal Statistical Society B 41: 148–177.Google Scholar

4.Glazebrook, K.D., Ruiz-Hernandez, D., & Kirkbride, C. (2006). Some indexable families of restless bandit problems. Advances in Applied Probability 38(3): 643–672.CrossRef Google Scholar

5.Glazebrook, K.D., Hodge, D.J., & Kirkbride, C. (2013). Monotone policies and indexability for bidirectional restless bandits. Advances in Applied Probability 45(1): 51–85.CrossRef Google Scholar

6.Jacko, P. (2011). Optimal index rules for single resource allocation to stochastic dynamic competitors. Proceedings of ValueTools, pp. 425–433.Google Scholar

7.Jacko, P. & Sanso, B. (2012). Optimal anticipative congestion control of flows with time-varying input stream Performance Evaluation 69(2): 86–101.CrossRef Google Scholar

8.Jacko, P. & Villar, S.S. (2012). Opportunistic schedulers for optimal scheduling of flows in wireless systems with ARQ Feedback. 24th International Teletraffic Conference (ITC) IEEE, pp. 1–8.Google Scholar

9.Kreucher, C., Blatt, D., Hero, A., & Kastella, K. (2006). Adaptive multi-modality sensor scheduling for detection and tracking of smart targets. Digital Signal Processing, 16(5): 546–567.CrossRef Google Scholar

10.Liu, K. & Zhao, Q. (2010). Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory 56(11): 5547–5567.CrossRef Google Scholar

11.Liu, K., Weber, R., & Zhao, Q. (2011). Indexability and whittle index for restless bandit problems involving reset processes. In proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference (CDC-ECC), pp. 7690–7696.CrossRef Google Scholar

12.Lovejoy, W.S. (1991). Computationally feasible bounds for partially observed Markov decision processes, Operations Research 39(1): 162–175.CrossRef Google Scholar

13.Mansourifard, P., Javidi, T., & Krishnamachari, B. (2012). Optimality of myopic policy for a class of monotone affine restless multi-armed bandits. In the Proceedings of the 51th IEEE International Conference on Decision and Control (CDC).CrossRef Google Scholar

14.Niño-Mora, J. (2001). Restless bandits, partial conservation laws and indexability. Advances in Applied Probability 33(1): 76–98.CrossRef Google Scholar

15.Niño-Mora, J. (2007). Dynamic priority allocation via restless bandit marginal productivity indices. Top 15(2): 161–198.CrossRef Google Scholar

16.Niño-Mora, J. & Villar, S.S. (2011). Sensor scheduling for hunting elusive hiding targets via Whittle's Restless Bandit Index Policy. 5th International Conference on Network Games, Control and Optimization (NetGCooP), pp. 1–8.Google Scholar

17.Papadimitriou, C.H. & Tsitsiklis, J.N. (1999). The complexity of optimal queuing network control. Mathematics of Operations Research 24(2): 293–305.CrossRef Google Scholar

18.Sondik, E.J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: discounted costs. Operations Research 26(2): 282–304.CrossRef Google Scholar

19.Villar, S.S. (2012). Restless bandit models for sensor management. LAP LAMBERT Academic Publishing, Germany.Google Scholar

20.Weber, R.R. & Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability 27(3): 637–648.CrossRef Google Scholar

21.Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability 25: 287–298.CrossRef Google Scholar

22.Zhao, Q., Krishnamachari, B., & Liu, K. (2008). On myopic sensing for multi-channel opportunistic access: Structure, optimality, and performance. IEEE Transactions on Wireless Communications 7(12): 5431–5440.CrossRef Google Scholar

23.Mathpages, Algebra, Linear Fractional Transformations. Available from http://www.mathpages.com/home/kmath464/kmath464.htm Google Scholar

Article contents

INDEXABILITY AND OPTIMAL INDEX POLICIES FOR A CLASS OF REINITIALISING RESTLESS BANDITS

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests