Book contents
- Frontmatter
- Contents
- Acknowledgements
- Conventions on notation
- Tour d'Horizon
- Part I Distributional networks
- Part II Artificial neural networks
- Part III Processing networks
- Part IV Communication networks
- Appendix 1 Spatial integrals for the telephone problem
- Appendix 2 Bandit and tax processes
- Appendix 3 Random graphs and polymer models
- References
- Index
Appendix 2 - Bandit and tax processes
Published online by Cambridge University Press: 23 November 2009
- Frontmatter
- Contents
- Acknowledgements
- Conventions on notation
- Tour d'Horizon
- Part I Distributional networks
- Part II Artificial neural networks
- Part III Processing networks
- Part IV Communication networks
- Appendix 1 Spatial integrals for the telephone problem
- Appendix 2 Bandit and tax processes
- Appendix 3 Random graphs and polymer models
- References
- Index
Summary
The seminal works are those of Gittins (1979, 1989), which presented the key insight and unlocked the multi-armed bandit problem, and Klimov (1974, 1978), which developed a frontal attack on the tax problem. Gittins had in fact perceived the essential step to solution by 1968, and described this at the European Meeting of Statisticians in 1972; see Gittins and Jones (1974). By this time Olivier (1972) had independently hit upon much the same approach. A direct dynamic programming proof of the optimality of the Gittins index policy was given in Whittle (1980), and then generalised to the cases of an open process and of the tax problem in Whittle (1981b, 2005).
For present purposes, we shall not follow the historical sequence of steps that suggested the Gittins index, but shall simply give the dynamic programming arguments that confirm optimality of the index policy and evaluate its performance. We consider only the undiscounted case, which shows particular simplifications and covers the needs of the text.
As mentioned in the text, there are at least two levels of aspiration to optimality in the undiscounted case. One is simply to minimise the average cost, and the second is to minimise also the transient cost: the extra cost incurred in passage from a given initial state to the equilibrium regime. As it turns out, the index policy is optimal at both levels.
Bandit processes
Consider a Markov decision process with the same dynamics over states as that supposed for items over nodes in Chapter 14.
- Type
- Chapter
- Information
- NetworksOptimisation and Evolution, pp. 234 - 239Publisher: Cambridge University PressPrint publication year: 2007