Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-11T07:38:51.497Z Has data issue: false hasContentIssue false

Gradient approach for recursive estimation and control in finite Markov chains

Published online by Cambridge University Press:  01 July 2016

Yousri M. El-Fattah*
Affiliation:
Faculté des Sciences, Rabat
*
Postal address: Laboratoire d'Electronique et d'sétude des Systèmes Automatiques, Faculté des Sciences, B.P. 1014, Rabat, Morocco.

Abstract

The problem studied is that of controlling a finite Markov chain so as to maximize the long-run expected reward per unit time. The chain's transition probabilities depend upon an unknown parameter taking values in a subset [a, b] of Rn. A control policy is defined as the probability of selecting a control action for each state of the chain. Derived is a Taylor-like expansion formula for the expected reward in terms of policy variations. Based on that result, a recursive stochastic gradient algorithm is presented for the adaptation of the control policy at consecutive times. The gradient depends on the estimated transition parameter which is also recursively updated using the gradient of the likelihood function. Convergence with probability 1 is proved for the control and estimation algorithms.

Type
Research Article
Copyright
Copyright © Applied Probability Trust 1981 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] Bertsekas, D. P. (1976) Dynamic Programming and Stochastic Control. Academic Press, New York.Google Scholar
[2] Borkar, V. and Varaiya, P. (1980) Adaptive control of Markov chains. IEEE Trans. Automatic Control AC-24, 953957.Google Scholar
[3] Cox, D. R. and Miller, H. D. (1965) The Theory of Stochastic Processes. Methuen, London.Google Scholar
[4] Denardo, E. V. (1973) A Markovian decision problem. In Mathematical Programming, ed. Hu, T. C. and Robinson, S. M., Academic Press, New York.Google Scholar
[5] Derman, C. (1970) Finite State Markovian Decision Processes. Academic Press, New York.Google Scholar
[6] Doshi, B. and Shreve, S. E. (1980) Strong consistency of a modified maximum likelihood estimator for controlled Markov chains. J. Appl. Prob. 17, 726734.CrossRefGoogle Scholar
[7] Durand, E. (1961) Solutions numériques des équations algébraiques, II. Masson, Paris.Google Scholar
[8] Flerov, Yu. A. (1972) Some classes of multi-input automata. J. Cybernetics 2, 112122.CrossRefGoogle Scholar
[9] Howard, R. A. (1962) Dynamic Programming and Markov Processes. Wiley, New York.Google Scholar
[10] Lyubchik, L. M. and Poznyak, A. S. (1974) Learning automata in stochastic plant control problems. Automation and Remote Control 35, 777789.Google Scholar
[11] Mandl, P. (1974) Estimation and control in Markov chains. Adv. Appl. Prob. 6, 4060.CrossRefGoogle Scholar
[12] Nevelson, M. B. and Khasmin'skii, R. Z. (1972) Stochastic Approximation and Recursive Estimation. American Mathematical Society, Providence, RI.Google Scholar
[13] Polyak, B. T. and Tsypkin, Ya. Z. (1973) Pseudo-gradient adaptation and training algorithms. Automation and Remote Control 34, 377397.Google Scholar
[14] Poznyak, A. S. (1973) Learning automata in stochastic programming problems. Automation and Remote Control 34, 16081619.Google Scholar
[15] Riordon, J. S. (1969) An adaptive automaton controller for discrete time Markov processes. Automatica 5, 721730.CrossRefGoogle Scholar
[16] Robbins, H. and Siegmund, D. (1971) A convergence theorem for non-negative almost super martingales and some applications. In Optimization Methods in Statistics, ed. Rustagi, J. S., Academic Press, New York, 233257.Google Scholar
[17] Ross, S. M. (1970) Applied Probability Models with Optimization Applications. Holden Day, San Francisco.Google Scholar