Hostname: page-component-cd9895bd7-dk4vv Total loading time: 0 Render date: 2024-12-23T00:12:58.511Z Has data issue: false hasContentIssue false

Maximizing the probability of visiting a set infinitely often for a Markov decision process with Borel state and action spaces

Published online by Cambridge University Press:  22 August 2024

François Dufour*
Affiliation:
Bordeaux INP
Tomás Prieto-Rumeau*
Affiliation:
UNED
*
*Postal address: INRIA Team Astral, 200 avenue de la Vieille Tour, 33405 Talence cedex. Email address: [email protected]
**Postal address: Department of Statistics, Operations Research, and Numerical Calculus, Faculty of Science, UNED, calle Juan del Rosal 10, 28040 Madrid, Spain. Email address: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

We consider a Markov control model with Borel state space, metric compact action space, and transitions assumed to have a density function with respect to some probability measure satisfying some continuity conditions. We study the optimization problem of maximizing the probability of visiting some subset of the state space infinitely often, and we show that there exists an optimal stationary Markov policy for this problem. We endow the set of stationary Markov policies and the family of strategic probability measures with adequate topologies (namely, the narrow topology for Young measures and the $ws^\infty$-topology, respectively) to obtain compactness and continuity properties, which allow us to obtain our main results.

Type
Original Article
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

Markov decision processes (MDPs) are a family of controlled stochastic processes which are suitable for the modeling of several sequential decision-making problems under uncertainty. They arise in many applications, such as engineering, computer science, telecommunications, finance, etc. Their study started in the late 1950s and the early 1960s with the seminal works of Bellman, Blackwell, Howard, and Veinott, to name just a few authors; see, e.g., [Reference Bellman8, Reference Howard21]. Both the theoretical foundations and applications of MDPs continue to be areas of active research. From a theoretical point of view, there are several techniques which allow one to establish the existence of optimal controls, as well as to analyze the main features of the optimal value function. Roughly speaking, there exist two families of such techniques. The first is related to the dynamic programming principle established by Bellman, and is also known as the backward induction principle for the finite-horizon case. In this framework, the optimality equation states that the value function is the fixed point of an operator (the so-called Bellman operator) which incorporates the current cost plus an expected value related to future costs. This approach and its various extensions, such as the value iteration and policy iteration algorithms, are studied in detail in the references [Reference Bäuerle and Rieder6, Reference Bertsekas and Tsitsiklis9, Reference Dynkin and Yushkevich16, Reference Hernández-Lerma and Lasserre18Reference Hinderer20, Reference Puterman26, Reference Sennott32] (this is a non-exhaustive list). The second family of methods is based on an equivalent formulation of MDPs in a linear programming context. The key idea is to transform the original dynamic control problem into an infinite-dimensional static optimization problem over a space of finite measures. In this setting, the underlying variables of the linear program are given by the occupation measures of the controlled process, satisfying the so-called characteristic equation, and the value of the linear program is the integral of the one-step cost function with respect to an occupation measure. This last technique is particularly well adapted to problems with constraints; see the references [Reference Borkar10, Reference Hernández-Lerma and Lasserre18, Reference Piunovskiy24Reference Puterman26], among others, for a detailed presentation and analysis of this approach. Although different, these two approaches share a common characteristic: namely, they are well adapted to additive-type performance criteria, but they do not allow for the study of non-additive performance criteria.

It must be emphasized that the non-additive type of criterion has undergone significant development in recent years; see for example the risk-sensitive optimality criterion [Reference Bäuerle and Rieder7, Reference Cavazos-Cadena12, Reference Zhang34] and the so-called gambling houses [Reference Venel and Ziliotto33]. The distinctive feature of these non-additive criteria is that the criterion to be optimized cannot be decomposed as the sum of individual contributions (typically, the rewards earned at each single transition), nor is it linearly related to such sums (as for the classical long-run average-reward optimality criterion). Non-additive criteria introduce some nonlinearity in the objective function via, e.g., a utility function. For instance, in risk-sensitive models, the criterion to be maximized is the expected utility of the total rewards over a finite time horizon or the total discounted rewards over an infinite time horizon. Nonlinearity is introduced precisely by the utility function, which often takes the form of a power function or a logarithmic function. The usual technique for dealing with such problems is to develop an ad hoc dynamic-programming-like approach by introducing an operator which is nonlinear in the reward function. In gambling house problems, the expectation of a long-run average reward is maximized, making this again a nonlinear problem. Conditions under which this problem corresponds, in fact, to an average-reward MDP are studied. In summary, non-additive problems are usually tackled by relating them to other standard additive problems or by using operators resembling the Bellman principle. For the problem we are interested in, which is also of a non-additive nature, Schäl [Reference Schäl31] developed a ‘vanishing discount factor’ approach based on total-expected-reward MDPs (details are given in the next paragraph).

In this paper, we study another non-additive optimality criterion which aims at maximizing the probability of visiting a certain subset of the state space infinitely often. This criterion is also asymptotic; more precisely, the performance function cannot be approximated by finite-horizon criteria, since the distribution of the state process over a finite horizon does not allow one to characterize the whole history of the process (or at least its tails) and contains no relevant information about whether a subset of the state space has been reached infinitely often. It is important to emphasize that this optimality criterion has been very little studied in the literature. To the best of our knowledge, it was first studied by Schäl in [Reference Schäl31], who showed the existence of an optimal deterministic stationary policy in the context of an MDP with finite state space. More recently, in [Reference Dufour and Prieto-Rumeau15], this existence result has been generalized to models with countable state space. The approach proposed by Schäl in [Reference Schäl31] uses properties of continuity of matrix inversion and of fundamental matrices of finite-state Markov chains and is therefore restricted to the finite-state framework. The results obtained in [Reference Dufour and Prieto-Rumeau15] rely heavily on the fact that the state space is countable. It should be noted that in this particular context the space of deterministic stationary policies is compact with respect to pointwise convergence.

The objective of the present paper is to extend these results to a model with a general Borel state space and to show the existence of an optimal randomized stationary policy. This is a continuation of the results presented in [Reference Dufour and Prieto-Rumeau15], in the sense that we take up the idea of introducing an auxiliary control model as a means of analyzing the original control model. This auxiliary model is an MDP with total expected reward that, roughly speaking, can be studied by finding a suitable topology on the set of strategic probability measures to ensure that this set is compact and the expected reward functional is semicontinuous.

In relation to [Reference Dufour and Prieto-Rumeau15], an important difficulty in our paper is to show, in the context of a general state space, the compactness of the space of strategic measures induced by randomized stationary policies; this is unlike the case of a countable state space, for which the proof was straightforward. Our approach relies on the introduction of Young measures to model randomized stationary policies and thus endow this space with a topology. In this framework, the space of stationary strategic measures can be seen as the image of the space of Young measures under a certain map. The difficulty is to show the continuity of this map. Another delicate point is to provide a sufficient condition to ensure the continuity of the criterion for the auxiliary model that is written as a function on the space of strategic measures. In [Reference Dufour and Prieto-Rumeau15], a condition is formulated in terms of the entrance time to the set of recurrent states. This condition is based on the fact that the countable state space of a Markov chain can be split into two classes: recurrent and transient states. In the context of a general state space, this structural property is not that simple. To overcome this difficulty, we introduce a general condition imposing a drift condition by using standard Foster–Lyapunov criteria. Under these conditions, we are able to prove that the problem of maximizing the probability of visiting a set infinitely often has an optimal policy which is Markov stationary. It is worth stressing that the result on the compactness of strategic probability measures for stationary Markov policies (Theorem 2.1) is a new result and is interesting in itself, not just in the context of the specific control model in this paper.

The problem studied in this paper (that of maximizing the probability of visiting a subset of the state space infinitely often) can be applied in population dynamics problems in which the population is controlled so as to be driven to some target set or to avoid some set. Indeed, the symmetric problem of minimizing the probability of being absorbed by some subset of the state space can be interpreted as the problem of preventing the population from being trapped in some region of the state space. Research is currently in progress on a game-theoretic approach to this problem, in which two competing species try to drive each other out of some region of the state space by maximizing or minimizing the probabilities of remaining in that region from some time onward.

Notation. We now introduce some notation used throughout the paper. We will write $\mathbb{N}=\{0,1,2,\ldots\}$ . On a measurable space $(\boldsymbol{\Omega},\boldsymbol{\mathcal{F}})$ we will write $\boldsymbol{\mathcal{P}}(\boldsymbol{\Omega})$ for the set of probability measures, assuming that there is no risk of confusion regarding the $\sigma$ -algebra on $\boldsymbol{\Omega}$ . If $(\boldsymbol{\Omega'},\boldsymbol{\mathcal{F'}})$ is another measurable space, a stochastic kernel on $\boldsymbol{\Omega'}$ given $\boldsymbol{\Omega}$ is a mapping $Q\,:\,\boldsymbol{\Omega}\times\boldsymbol{\mathcal{F'}}\rightarrow[0, 1]$ such that $\omega\mapsto Q(B|\omega)$ is measurable on $(\boldsymbol{\Omega},\boldsymbol{\mathcal{F}})$ for every $B\in\boldsymbol{\mathcal{F}}'$ , and $B\mapsto Q(B|\omega)$ is in $\boldsymbol{\mathcal{P}}(\boldsymbol{\Omega'})$ for every $\omega\in\boldsymbol{\Omega}$ . We denote by $\boldsymbol{\mathcal{B}}(\boldsymbol{\Omega})$ the family of bounded measurable functions $f\,:\,\boldsymbol{\Omega}\rightarrow\mathbb{R}$ (here, $\mathbb{R}$ is endowed with its Borel $\sigma$ -algebra). If Q is a stochastic kernel on $\boldsymbol{\Omega'}$ given $\boldsymbol{\Omega}$ and f is a real-valued measurable function on $\boldsymbol{\Omega'}$ , allowed to take infinite values, we will denote by Qf the $\boldsymbol{\mathcal{F}}$ -measurable function defined by the following (provided that the integral is well defined):

\begin{align*}Qf(\omega)=\int_{\boldsymbol{\Omega'}} f(z)Q(dz|\omega)\quad \textrm{for}\ \omega\in\boldsymbol{\Omega}.\end{align*}

In general, for a product of measurable spaces we will always consider the product $\sigma$ -algebra. The indicator function of a set $C\in\boldsymbol{\mathcal{F}}$ is written $\textbf{I}_C$ .

Throughout this paper, any metric space $\textbf{S}$ will be endowed with its Borel $\sigma$ -algebra $\boldsymbol{\mathfrak{B}}(\textbf{S})$ . Also, when considering the product of a finite family of metric spaces, we will consider the product topology (which makes the product again a metric space). By $\boldsymbol{\mathcal{C}}(\textbf{S})$ we will denote the family of bounded continuous functions $f\,:\,\boldsymbol{\textbf{S}}\rightarrow\mathbb{R}$ . Let $(\boldsymbol{\Omega},\boldsymbol{\mathcal{F}})$ be a measurable space and $\textbf{S}$ a metric space. We say that $f\,:\, \boldsymbol{\Omega}\times \textbf{S}\rightarrow\mathbb{R}$ is a Carathéodory function if $f(\omega,\cdot)$ is continuous on $\textbf{S}$ for every $\omega\in \boldsymbol{\Omega}$ , and $f(\cdot,s)$ is an $\boldsymbol{\mathcal{F}}$ -measurable function for every $s\in\textbf{S}$ . The family of Carathéodory functions thus defined is denoted by ${\textit{Car}}(\boldsymbol{\Omega}\times \textbf{S})$ . The family of Carathéodory functions which, in addition, are bounded is denoted by ${\textit{Car}}_{b}(\boldsymbol{\Omega}\times \textbf{S})$ . When the metric space $\textbf{S}$ is separable, any $f\in{\textit{Car}}(\boldsymbol{\Omega}\times \textbf{S})$ is a jointly measurable function on $(\boldsymbol{\Omega}\times\textbf{S},\boldsymbol{\mathcal{F}}\otimes\boldsymbol{\mathfrak{B}}(\textbf{S}))$ . The Dirac probability measure $\delta_s\in\boldsymbol{\mathcal{P}}(\textbf{S})$ concentrated at some $s\in\textbf{S}$ is given by $B\mapsto \textbf{I}_B(s)$ for $B\in\boldsymbol{\mathfrak{B}}(\textbf{S})$ .

Given $\lambda\in\boldsymbol{\mathcal{P}}(\textbf{S})$ , let $L^1(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda)$ be the family of real-valued measurable functions on $\textbf{S}$ which are $\lambda$ -integrable (as usual, functions which are equal $\lambda$ -almost everywhere are identified). With the usual definition of the 1-norm, we have that $L^1(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda)$ is a Banach space and convergence in norm is

\begin{align*}f_n\stackrel{L^1}{\longrightarrow}f\quad\hbox{if and only if}\quad\int_{\textbf{S}} |f_n-f|\textrm{d}\lambda\rightarrow0.\end{align*}

The family of $\lambda$ -essentially bounded measurable functions is $L^\infty(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda)$ , and the essential supremum norm in $L^\infty$ is denoted by $\|\cdot\|$ . The weak $^{*}$ topology on the space $L^{\infty}(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda)$ , denoted by $\sigma(L^{\infty}(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda),L^{1}(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda))$ , is the coarsest topology for which the mappings

\begin{align*}f\mapsto \int_{\textbf{S}} fh\textrm{d}\lambda\quad\hbox{for $h\in L^{1}(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda)$}\end{align*}

are continuous. We will write $f_n\ {\stackrel{*}{\rightharpoonup}\ } f$ when

\begin{align*} \int_{\textbf{S}} f_nh\textrm{d}\lambda\rightarrow \int_{\textbf{S}} fh\textrm{d}\lambda\quad \textrm{for every}\ h\in L^{1}(\textbf{S},\boldsymbol{\mathfrak{B}}(\textbf{S}),\lambda).\end{align*}

The rest of the paper is organized as follows. In Section 2, we define the original control model $\mathcal{M}$ and state some important preliminary results, the proof of one of these results being postponed to Appendix A, at the end of the paper. A family of auxiliary MDPs $\mathcal{M}_{\beta}$ , parametrized by $0<\beta<1$ , is introduced in Section 3. The main results of the paper are given in Section 4.

2. Definition of the control model $\mathcal{M}$

Elements of the control model $\mathcal{M}$ .

The MDP under consideration consists of the following elements:

  • The state space $\textbf{X}$ and the action space $\textbf{A}$ are both Borel spaces endowed with their respective Borel $\sigma$ -algebras $\boldsymbol{\mathfrak{B}}(\textbf{X})$ and $\boldsymbol{\mathfrak{B}}(\textbf{A})$ .

  • For each $x\in\textbf{X}$ , the nonempty measurable set $\textbf{A}(x)\subseteq \textbf{A}$ stands for the set of available actions in state x. Let $\textbf{K}=\{(x,a)\in\textbf{X}\times\textbf{A}\,:\, a\in \textbf{A}(x)\}$ be the family of feasible state–action pairs. Assumption (A.1) below will ensure that $\textbf{K}\in\boldsymbol{\mathfrak{B}}(\textbf{X})\otimes\boldsymbol{\mathfrak{B}}(\textbf{A})$ .

  • The initial distribution of the system is the probability measure $\nu\in\boldsymbol{\mathcal{P}}(\textbf{X})$ , and the transitions of the system are given by a stochastic kernel Q on $\textbf{X}$ given $\textbf{K}$ .

  • Finally, regarding the performance criterion to be optimized (which we will define later), a nontrivial measurable subset of the state space $G\subseteq\textbf{X}$ is given.

Our main conditions on the control model $\mathcal{M}$ are stated next. In Assumption (A.2) we introduce a reference probability measure $\lambda\in\boldsymbol{\mathcal{P}}(\textbf{X})$ .

Assumption A.

  1. (A.1) The action set $\textbf{A}$ is compact, and the correspondence from $\textbf{X}$ to subsets of $\textbf{A}$ defined by $x\mapsto \textbf{A}(x)$ is weakly measurable with nonempty compact values (see Definition 18.1 in [Reference Aliprantis and Border1]).

  2. (A.2) There exist a reference probability measure $\lambda\in\boldsymbol{\mathcal{P}}(\textbf{X})$ and a measurable function $q\,:\, \textbf{K}\times\textbf{X}\rightarrow\mathbb{R}^+$ such that for every $D\in\boldsymbol{\mathfrak{B}}(\textbf{X})$ and every $(x,a)\in\textbf{K}$ ,

    (2.1) \begin{align}Q(D|x,a)=\int_D q(x,a,y)\lambda(dy).\end{align}
    Also, for every $x\in\textbf{X}$ we have the following continuity condition:
    \begin{align*}\lim_{k\rightarrow\infty} \int_{\textbf{X}} |q(x,b_{k},y)-q(x,b,y)| \lambda(dy)=0\quad\hbox{whenever $b_k\rightarrow b$ in $\textbf{A}(x)$}. \end{align*}
  3. (A.3) The initial distribution $\nu$ is absolutely continuous with respect to $\lambda$ in Assumption (A.2).

Remark 2.1.

  1. (a) Under Assumption (A.1), and as a consequence of Theorem 18.6 in [Reference Aliprantis and Border1], we have that $\textbf{K}$ is indeed a measurable subset of $\textbf{X}\times\textbf{A}$ .

  2. (b) The continuity condition in Assumption (A.2) can equivalently be written as

    \begin{align*}q(x,b_k,\cdot)\stackrel{L^1}{\longrightarrow} q(x,b,\cdot)\quad \textrm{for any}\ x\in\textbf{X}\ \textrm{when}\ b_k\rightarrow b\ \textrm{in}\ \textbf{A}(x).\end{align*}
  3. (c) The condition in Assumption (A.3) is not restrictive at all. Indeed, if it were not true that $\nu\ll\lambda$ , then we would consider the reference probability measure $\eta=\frac{1}{2}(\nu+\lambda)$ in Assumption (A.2), so that (2.1) would become

    \begin{align*}Q(D|x,a)=\int_D q(x,a,y)\frac{d\lambda}{d\eta}(y)\,\eta(dy)\end{align*}
    and the continuity condition would be satisfied as well.
  4. (d) The fact that $\textbf{X}$ is a Borel space will be used to ensure that the $ws^\infty$ -topology on the set of strategic measures (details will be given below) coincides with the weak topology—see [Reference Balder3, Reference Nowak23]—and that the set of strategic probability measures is compact with this topology. Otherwise, no topological properties of $\textbf{X}$ will be used.

Control policies. The space of admissible histories up to time n for the controlled process is denoted by $\textbf{H}_{n}$ for $n\in\mathbb{N}$ . It is defined recursively by $\textbf{H}_0=\textbf{X}$ and $\textbf{H}_n=\textbf{K}\times\textbf{H}_{n-1}$ for any $n\ge1$ , and it is endowed with the corresponding product $\sigma$ -algebra. A control policy $\pi$ is a sequence $\{\pi_n\}_{n\ge0}$ of stochastic kernels on $\textbf{A}$ given $\textbf{H}_n$ , denoted by $\pi_n(da|h_n)$ , such that for each $n\ge0$ and $h_n=(x_0,a_0,\ldots,x_n)\in\textbf{H}_n$ we have $\pi_n(\textbf{A}(x_n)|h_n)=1$ . The set of all policies is denoted by $\boldsymbol{\Pi}$ .

Dynamics of the control model. The canonical sample space of all possible state and actions is $\boldsymbol{\Omega}=(\textbf{X}\times\textbf{A})^\infty$ , and it is endowed with the product $\sigma$ -algebra $\boldsymbol{\mathcal{F}}$ . Thus, a generic element $\omega\in\boldsymbol{\Omega}$ consists of a sequence of the form $(x_0,a_0,x_1,a_1,\ldots)$ , where $x_{n}\in \textbf{X}$ and $a_{n}\in\textbf{A}$ for any $n\in\mathbb{N}$ . For $n\in\mathbb{N}$ , the projection functions $X_{n}$ and $A_{n}$ from $\boldsymbol{\Omega}$ to $\textbf{X}$ and $\textbf{A}$ respectively are defined by $X_{n}(\omega)= x_{n}$ and $A_{n}(\omega)= a_{n}$ for $\omega=(x_0,a_0,\ldots,x_n,a_n,\ldots)$ , and they are called the state and control variables at time n. The history process up to time n is denoted by $H_n=(X_0,A_0,\ldots,X_n)$ .

For any policy $\pi\in\boldsymbol{\Pi}$ , there exists a unique probability measure $\mathbb{P}^{\pi}_{\nu}$ on $(\boldsymbol{\Omega},\boldsymbol{\mathcal{F}})$ supported on $\textbf{K}^\infty$ (that is, $\mathbb{P}^{\pi}_{\nu}( \textbf{K}^\infty)=1)$ which satisfies the following conditions:

  1. (i) We have $\mathbb{P}^{\pi}_{\nu} \{X_0\in D\}=\nu(D)$ for any $D\in \boldsymbol{\mathfrak{B}}(\textbf{X})$ .

  2. (ii) For any $n\ge0$ and $C\in\boldsymbol{\mathfrak{B}}(\textbf{A})$ we have $\mathbb{P}^{\pi}_{\nu}\{ A_n\in C | H_n\} = \pi_n(C|H_n)$ .

  3. (iii) For any $n\ge0$ and $D\in\boldsymbol{\mathfrak{B}}(\textbf{X})$ we have $\mathbb{P}^{\pi}_{\nu}\{ X_{n+1}\in D| H_n,A_n\}=Q(D | X_n,A_n)$ .

The probability measure $\mathbb{P}^{\pi}_{\nu}$ is usually referred to as a strategic probability measure. The corresponding expectation operator is $\mathbb{E}^{\pi}_{\nu}$ . The set of all strategic probability measures $\{\mathbb{P}^{\pi}_{\nu}\}_{\pi\in\boldsymbol{\Pi}}$ is denoted by $\boldsymbol{\mathcal{S}}_{\nu}\subseteq\boldsymbol{\mathcal{P}}(\boldsymbol{\Omega})$ .

We will also denote by $\mathbb{P}^\pi_{\nu,n}$ the distribution of the random variable $H_n$ ; more formally, for any $n\ge0$ we define $\mathbb{P}^\pi_{\nu,n}$ as the pushforward probability measure on $(\textbf{X}\times\textbf{A})^n\times\textbf{X}$ given by $\mathbb{P}^\pi_{\nu}\circ H_n^{-1}$ . Consequently, we have $\mathbb{P}^\pi_{\nu,n}(\textbf{H}_n)=1$ ; note also that $\mathbb{P}^\pi_{\nu,0}=\nu$ .

Optimality criterion. The performance criterion that we want to maximize is the probability of visiting the set G infinitely often. Let us denote by $N_{G}\,:\,\boldsymbol{\Omega}\rightarrow\mathbb{N}\cup\{\infty\}$ the integer-valued process that counts the total number of visits of the state process to the set G; that is,

\begin{align*}N_G(\omega)=\sum_{n=0}^\infty \textbf{I}_{G}(X_n(\omega)).\end{align*}

For notational convenience, if $I\subseteq\mathbb{N}$ is a set of decision epochs, we let $N_G(I)=\sum_{n\in I} \textbf{I}_{G}(X_n)$ , which stands for the number of visits to G at times in I.

For the initial distribution $\nu$ , the value function of the control model $\mathcal{M}$ is given by

(2.2) \begin{equation}V^*(\nu)=\sup_{\pi\in\boldsymbol{\Pi}} \mathbb{P}^{\pi}_{\nu} \{N_G=\infty\}.\end{equation}

A policy attaining the above supremum is said to be optimal. In the sequel, the model $\mathcal{M}$ introduced in this section will be referred to as the primary or original control problem.

Stationary Markov policies. Let $\boldsymbol{\Pi}_{s}$ be the set of stochastic kernels $\pi$ on $\textbf{A}$ given $\textbf{X}$ such that $\pi(\textbf{A}(x)|x)=1$ for each $x\in\textbf{X}$ . We have that $\boldsymbol{\Pi}_{s}$ is nonempty as a consequence of our hypotheses. Indeed, from the Kuratowski–Ryll-Nardzewski selection theorem [Reference Aliprantis and Border1, Theorem 18.13] we derive the existence of a measurable selector for the correspondence $x\mapsto\textbf{A}(x)$ ; that is, there exists a measurable $f\,:\,\textbf{X}\rightarrow\textbf{A}$ such that $f(x)\in\textbf{A}(x)$ for every $x\in\textbf{X}$ . Then we have that $\pi(da|x)=\delta_{f(x)}(da)$ indeed belongs to $\boldsymbol{\Pi}_{s}$ .

We say that $\{\pi_n\}_{n\in\mathbb{N}}\in\boldsymbol{\Pi}$ is a stationary Markov policy if there is some $\pi\in\boldsymbol{\Pi}_{s}$ such that

\begin{align*}\pi_n(\cdot|x_0,a_0,b_0,\ldots,x_n)=\pi(\cdot|x_{n})\quad\hbox{for all $n\ge0$ and $h_n=(x_0,a_0,\ldots,x_n)\in\textbf{H}_n$}.\end{align*}

Without risk of confusion, we will identify the set of all stationary Markov policies with $\boldsymbol{\Pi}_{s}$ itself. The set of all strategic probability measures for the stationary Markov policies $\{\mathbb{P}^{\pi}_{\nu}\}_{\pi\in\boldsymbol{\Pi}_s}$ is denoted by $\boldsymbol{\mathcal{S}}^s_{\nu}$ .

Lemma 2.1. If $\pi,\pi'\in\boldsymbol{\Pi}_s$ coincide $\lambda$ -almost everywhere (that is, $\pi(da|x)=\pi'(da|x)$ for all x in a set of $\lambda$ -probability one), then $\mathbb{P}^{\pi}_{\nu}=\mathbb{P}^{\pi'}_{\nu}$ .

Proof. Suppose that $Y\in\boldsymbol{\mathfrak{B}}(\textbf{X})$ with $\lambda(Y)=1$ is such that $\pi(da|x)=\pi'(da|x)$ for every $x\in Y$ . We will prove by induction on $n\ge0$ that $(X_0,A_0,\ldots,X_n,A_n)$ has the same distribution under $\mathbb{P}^{\pi}_{\nu}$ and under $\mathbb{P}^{\pi'}_{\nu}$ . Letting $n=0$ , and given $D\in\boldsymbol{\mathfrak{B}}(\textbf{X})$ and $C\in\boldsymbol{\mathfrak{B}}(\textbf{A})$ , we have

\begin{eqnarray*}\mathbb{P}^\pi_{\nu}\{X_0\in D,A_0\in C\} &=& \, \int_D \pi(C|x)\frac{d\nu}{d\lambda}(x)\, \lambda(dx)\ =\ \int_{D\cap Y} \pi(C|x)\frac{d\nu}{d\lambda}(x)\, \lambda(dx) \\&=& \, \int_{D} \pi'(C|x)\frac{d\nu}{d\lambda}(x)\, \lambda(dx) \ =\ \mathbb{P}^{\pi'}_{\nu}\{X_0\in D,A_0\in C\}.\end{eqnarray*}

Assuming the result is true for $n\ge0$ , and again letting $D\in\boldsymbol{\mathfrak{B}}(\textbf{X})$ and $C\in\boldsymbol{\mathfrak{B}}(\textbf{A})$ , we have

\begin{align*}\mathbb{P}^\pi_{\nu}\{ X_{n+1}\in D, A_{n+1}\in C\mid X_n=x,A_n=a\}= \int_D \pi(C|y) q(x,a,y)\lambda(dy), \end{align*}

which, by the same argument, is equal to $\mathbb{P}^{\pi'}_{\nu}\{ X_{n+1}\in D, A_{n+1}\in C\mid X_n=x,A_n=a\}$ . The conditional distributions and the distribution of $(X_n,A_n)$ being the same under $\pi$ and $\pi'$ , we conclude the result.

Remark 2.2. It turns out that two stationary Markov policies under the conditions of Lemma 2.1, i.e., which coincide $\lambda$ -almost surely, are indistinguishable, since they yield the same strategic probability measure; in addition, they have the same performance (recall (2.2)), since the criterion to be maximized depends on the strategic probability measures.

Young measures. Let $\boldsymbol{\mathcal{R}}\supseteq\boldsymbol{\Pi}_s$ be the set of stochastic kernels $\gamma$ on $\textbf{A}$ given $\textbf{X}$ such that $\gamma(\textbf{A}(x)|x)=1$ for $\lambda$ -almost every $x\in\textbf{X}$ . On $\boldsymbol{\mathcal{R}}$ we consider the equivalence relation $\approx$ defined as $\gamma\approx\gamma'$ when $\gamma(\cdot|x)=\gamma'(\cdot|x)$ for $\lambda$ -almost all $x\in\textbf{X}$ . The corresponding quotient space $\boldsymbol{\mathcal{R}}\,/\approx$ , denoted by $\boldsymbol{\mathcal{Y}}$ , is the so-called set of Young measures. As a consequence of Lemma 2.1 and Remark 2.2, there is no loss of generality in identifying the set of stationary Markov policies $\boldsymbol{\Pi}_s$ with $\boldsymbol{\mathcal{Y}}$ , since all the elements in the same $\approx$ -equivalence class have the same strategic probability measure. Thus we will indiscriminately use the symbols $\boldsymbol{\Pi}_s$ and $\boldsymbol{\mathcal{Y}}$ to refer to both the stationary Markov policies and the set of Young measures. Hence, we will write

\begin{align*}\boldsymbol{\mathcal{S}}^s_{\nu}=\{\mathbb{P}^\pi_{\nu}\}_{\pi\in\boldsymbol{\mathcal{Y}}}\subseteq\boldsymbol{\mathcal{S}}_{\nu}.\end{align*}

The set of Young measures $\boldsymbol{\mathcal{Y}}$ will be equipped with the narrow (stable) topology. This is the coarsest topology which makes the following mappings continuous:

\begin{align*}\pi\mapsto \int_{\textbf{X}}\int_{\textbf{A}}f(x,a) \pi (da|x)\lambda(dx),\end{align*}

for any $f\in {\textit{Car}}(\textbf{X}\times\textbf{A})$ such that for some $\Phi$ in $L^1(\textbf{X},\boldsymbol{\mathfrak{B}}(\textbf{X}),\lambda)$ we have $|f(x,a)| \leq \Phi(x)$ for every $(x,a)\in\textbf{X}\times\textbf{A}$ ; see [Reference Balder2, Theorem 2.2]. By [Reference Balder4, Lemma 1], the set $\boldsymbol{\mathcal{Y}}$ endowed with the narrow topology becomes a compact metric space. We state this result in the next lemma.

Lemma 2.2. Under Assumption A, the sets $\boldsymbol{\mathcal{Y}}$ of Young measures and $\boldsymbol{\Pi}_s$ of stationary Markov policies are compact metric spaces when endowed with the narrow topology.

The $ws^\infty$ -topology. Finally, we introduce the so-called $ws^\infty$ -topology on $\boldsymbol{\mathcal{S}}_{\nu}$ . Let

\begin{align*}\boldsymbol{\mathcal{U}}_{0}=\boldsymbol{\mathcal{B}}(\textbf{X})\quad\hbox{and}\quad \boldsymbol{\mathcal{U}}_{n}={\textit{Car}}_b(\textbf{X}^{n+1}\times\textbf{A}^n)\quad\hbox{for $n\ge1$},\end{align*}

where the arguments of $g\in\boldsymbol{\mathcal{U}}_n$ will be sorted as $(x_0,a_0,\ldots,x_{n-1},a_{n-1},x_{n})$ . The $ws^\infty$ -topology on the set $\boldsymbol{\mathcal{S}}_{\nu}$ —see [Reference Schäl28]—is the smallest topology that makes the mappings

\begin{align*}\mathbb{P} \mapsto \int_{\boldsymbol{\Omega}} g(H_n) \textrm{d} \mathbb{P} \end{align*}

continuous for any $n\ge 0$ and $g\in\boldsymbol{\mathcal{U}}_{n}$ . Therefore, a sequence $\{\mathbb{P}_k\}_{k\in\mathbb{N}}\subseteq\boldsymbol{\mathcal{S}}_{\nu}$ converges to $\mathbb{P}\in\boldsymbol{\mathcal{S}}_{\nu}$ for the $ws^\infty$ -topology, denoted by $\mathbb{P}_k\Rightarrow \mathbb{P}$ , if and only if

(2.3) \begin{equation}\lim_{k\rightarrow\infty} \int_{\boldsymbol{\Omega}} g(H_n) \textrm{d} \mathbb{P}_k=\int_{\boldsymbol{\Omega}} g(H_n) \textrm{d} \mathbb{P}\end{equation}

for any $n\in\mathbb{N}$ and $g\in\boldsymbol{\mathcal{U}}_{n}$ .

The proof of Lemmas 2.3 and Theorem 2.1 below are fairly involved; they are presented in Appendix A. Note that Lemma 2.3 simply states that the mapping that associates to a Young measure its corresponding strategic probability measure is continuous.

Lemma 2.3. Suppose that Assumption A holds. If $\{\pi_k\}_{k\in\mathbb{N}}$ and $\pi$ in $\boldsymbol{\mathcal{Y}}$ are such that $\pi_k\rightarrow\pi$ , then $\mathbb{P}^{\pi_k}_{\nu}\Rightarrow\mathbb{P}^\pi_{\nu}$ .

Theorem 2.1. Under Assumption A, the sets $\boldsymbol{\mathcal{S}}_{\nu}$ and $\boldsymbol{\mathcal{S}}^s_{\nu}$ are compact metric spaces when endowed with the $ws^\infty$ -topology.

It is worth mentioning that compactness of $\boldsymbol{\mathcal{S}}_{\nu}$ is a known result (see [Reference Balder3] or [Reference Schäl28]), whereas the new result that we establish here is compactness of $\boldsymbol{\mathcal{S}}^s_{\nu}$ .

3. The control model $\mathcal{M}_{\beta}$

By its definition, the performance criterion introduced above is of asymptotic type, and it cannot be written in an additive form. Therefore, it appears difficult to study this type of control problem directly by using the traditional techniques, such as the dynamic programming or linear programming approach. To address this difficulty, we introduced in [Reference Dufour and Prieto-Rumeau15], in the particular context of a countable state space, an auxiliary model $\mathcal{M}_{\beta}$ as a tool for analyzing the original control problem.

Elements of the control model $\mathcal{M}_{\beta}$ . Given $0<\beta<1$ , the auxiliary control model $\mathcal{M}_{\beta}$ , which is based on the primary control model $\mathcal{M}$ , is a ‘total-expected-reward’ MDP defined as follows:

  • The state and action spaces are augmented with isolated points,

    \begin{align*}\hat{\textbf{X}}=\textbf{X}\cup\{\Delta\}\quad\hbox{and}\quad \hat{\textbf{A}}=\textbf{A}\cup\{a_{\Delta}\},\end{align*}
    so that $\hat{\textbf{X}}$ and $\hat{\textbf{A}}$ are again Borel spaces.
  • The set of available actions when the state variable is $x\in\hat{\textbf{X}}$ is the set $\hat{\textbf{A}}(x)\subseteq\hat{\textbf{A}}$ , where

    \begin{align*}\hat{\textbf{A}}(x)=\textbf{A}(x)\ \ \hbox{for $x\in\textbf{X}$}\quad\hbox{and}\quad\hat{\textbf{A}}(\Delta)=\{a_{\Delta}\}.\end{align*}
    Let $\hat{\textbf{K}}\subseteq\hat{\textbf{X}}\times\hat{\textbf{A}}$ be the set of feasible state–action pairs; hence $\hat{\textbf{K}}=\textbf{K}\cup\{(\Delta,a_{\Delta})\}$ .
  • The dynamics of this model is governed by the stochastic kernel $Q_{\beta}$ defined by

    \begin{align*}Q_{\beta}(\Gamma | x,a)=\ & \textbf{I}_{G}(x) \big[ \beta Q(\Gamma\cap\textbf{X} | x,a) + (1-\beta)\delta_{\Delta}(\Gamma) \big] \\ &+ \textbf{I}_{\textbf{X}-G}(x) Q(\Gamma\cap\textbf{X} | x,a) + \textbf{I}_{\{\Delta\}}(x) \delta_{\Delta}(\Gamma)\end{align*}
    for any $\Gamma\in \boldsymbol{\mathfrak{B}}(\hat{\textbf{X}})$ and $(x,a)\in\hat{\textbf{K}}$ .

    The initial probability distribution is the same as for $\mathcal{M}$ , that is, $\nu\in\boldsymbol{\mathcal{P}}(\textbf{X})$ . With a slight abuse of notation, we will also denote by $\nu$ the measure $\nu(\cdot\cap\textbf{X})$ on $\hat{\textbf{X}}$ .

  • The reward function is $\textbf{I}_{G}(x)$ for $(x,a)\in\hat{\textbf{K}}$ , where $G\subseteq\textbf{X}$ is the measurable set given in the definition of $\mathcal{M}$ .

Hence, the model $\mathcal{M}_{\beta}$ is indeed a standard total-expected-reward MDP parametrized by $0<\beta< 1$ . The transitions generated by $Q_{\beta}$ can be interpreted very simply as follows. If the state process is in $\textbf{X}-G$ , then the transition is made according to Q, the stochastic kernel of the original model. If the state process is in G, then either the transition is made according to the original model $\mathcal{M}$ (with probability $\beta$ ), or the state process moves to $\Delta$ (with probability $1-\beta$ ). Finally, it is easy to see that $\Delta$ is an absorbing state under $a_{\Delta}$ , the unique action available at $\Delta$ . However, it should be observed that $\mathcal{M}_{\beta}$ is not necessarily an absorbing MDP in the terminology of [Reference Feinberg and Rothblum17, Section 2], because absorption in $\Delta$ may not occur in a finite expected time.

The definition of $\mathcal{M}_{\beta}$ is somewhat inspired by the well-known equivalent formulation of a discounted MDP, with discount factor $\beta$ , as a total-reward MDP in which the transition probabilities are multiplied by the discount factor $\beta$ , while $1-\beta$ is the probability of killing the process at each transition; see, e.g., [Reference Feinberg and Rothblum17, p. 132]. In our case, however, the $\beta$ factor is incorporated only when the process is in G.

Control policies. The construction and definition of the control policies for $\mathcal{M}_{\beta}$ is very similar to what was presented for the original model in Section 2. Therefore, we will just present the key elements and will skip some details. Letting $\hat{\textbf{H}}_0=\hat{\textbf{X}}$ , for $n\ge1$ we have

\begin{align*}\hat{\textbf{H}}_n=\textbf{H}_n\cup\bigcup_{j=0}^n\big( \textbf{K}^{j}\times \{ (\Delta,a_{\Delta})\}^{n-j}\times\{\Delta\}\big),\end{align*}

which includes the histories of the original control model, plus those paths reaching $\Delta$ at time $0\le j\le n$ and remaining in $\Delta$ thereafter. The set of all admissible histories is given by

(3.1) \begin{equation}\textbf{K}^{\infty}\cup\bigcup_{n=0}^\infty\big( \textbf{K}^{n}\times \{ (\Delta,a_{\Delta})\}^{\infty}\big).\end{equation}

The family of control policies for $\mathcal{M}_{\beta}$ is denoted by $\hat{\boldsymbol{\Pi}}$ , and the family of stationary Markov policies is $\hat{\boldsymbol{\Pi}}_{s}$ . These are defined similarly to those of $\mathcal{M}$ , where we replace $\textbf{X}$ , $\textbf{A}$ , and $\textbf{H}_{n}$ with $\hat{\textbf{X}}$ , $\hat{\textbf{A}}$ , and $\hat{\textbf{H}}_{n}$ , respectively.

Dynamics of the control model. Let $\hat{\boldsymbol{\Omega}}=(\hat{\textbf{X}}\times\hat{\textbf{A}})^\infty$ be the state–action canonical sample space endowed with its product $\sigma$ -algebra $\hat{\boldsymbol{\mathcal{F}}}$ . The state, action, and history process projections are denoted by $\hat{X}_n$ , $\hat{A}_n$ , and $\hat{H}_n$ , and they are defined as in Section 2. The counting processes associated to the visits of the state process to the set G are $\hat{N}_G=\sum_{n=0}^\infty \textbf{I}_{G}(\hat{X}_n)$ and $\hat{N}_G(I)=\sum_{n\in I} \textbf{I}_{G}(\hat{X}_k)$ for $I\subseteq\mathbb{N}$ .

Recall that we were considering a reference probability measure $\lambda\in\boldsymbol{\mathcal{P}}(\textbf{X})$ for the control model $\mathcal{M}$ . For the auxiliary control model, we will use the reference probability measure $\hat\lambda\in\boldsymbol{\mathcal{P}}(\hat{\textbf{X}})$ defined by

\begin{align*}\hat{\lambda}(\Gamma)=\frac{1}{2}\big(\lambda(\Gamma\cap\textbf{X})+\delta_{\Delta}(\Gamma)\big)\quad\hbox{for $\Gamma\in\boldsymbol{\mathfrak{B}}(\hat{\textbf{X}})$}.\end{align*}

Given $\hat{\pi}\in\hat{\boldsymbol{\Pi}}$ and the initial distribution $\nu$ , there exists a unique probability measure $\mathbb{P}^{\beta,\hat{\pi}}_{\nu}$ on $(\hat{\boldsymbol{\Omega}},\hat{\boldsymbol{\mathcal{F}}})$ supported on the set (3.1) of histories that models the dynamics of $\mathcal{M}_{\beta}$ . The corresponding expectation operator is denoted by $ \mathbb{E}^{\beta,\hat{\pi}}_{\nu}$ .

Optimality criterion. For the model $\mathcal{M}_{\beta}$ , the objective is the maximization of the total-expected-reward criterion given by the expected number of visits to the set G: for any $\hat{\pi}\in\hat{\boldsymbol{\Pi}}$ ,

\begin{align*} \mathbb{E}^{\beta,\hat{\pi}}_{\nu} \Big[\sum_{n=0}^\infty \textbf{I}_{G}(\hat{X}_n) \Big]= \mathbb{E}^{\beta,\hat{\pi}}_{\nu} [\hat{N}_G],\end{align*}

and we let

\begin{align*}V^*_{\beta}(\nu)=\sup_{\hat{\pi}\in\hat{\boldsymbol{\Pi}}} \mathbb{E}^{\beta,\hat{\pi}}_{\nu} [\hat{N}_G]\end{align*}

be the value function. A policy $\pi^*\in\hat{\boldsymbol{\Pi}}$ satisfying $\mathbb{E}^{\beta,\pi^{*}}_{\nu} [\hat{N}_G]=V^*_{\beta}(\nu)$ is said to be optimal.

Remark 3.1. Given $\pi=\{\pi_n\}_{n\in\mathbb{N}}$ in $\boldsymbol{\Pi}$ , we can define $\pi^{\Delta}=\{\pi^{\Delta}_n\}_{n\in\mathbb{N}}$ in $\hat{\boldsymbol{\Pi}}$ by setting $\pi^{\Delta}_n(\cdot|h_n)=\pi_n(\cdot|h_n)$ whenever $h_n\in\textbf{H}_n$ and $\pi^{\Delta}_n(\cdot|h_n)=\delta_{a_{\Delta}}(\cdot)$ for $h_n=(x_0,a_0,\ldots,\Delta)\in\hat{\textbf{H}}_n-\textbf{H}_n$ . Conversely, given a control policy $\hat{\pi}=\{\hat{\pi}_n\}_{n\in\mathbb{N}}$ in $\hat{\boldsymbol{\Pi}}$ , we can restrict it to the sample paths in $\textbf{H}_n$ to define a control policy $\hat{\pi}^{|\textbf{X}}=\{\hat{\pi}^{|\textbf{X}}_n\}_{n\in\mathbb{N}}$ in ${\boldsymbol{\Pi}}$ given by $\hat{\pi}^{|\textbf{X}}(\cdot|h_n)=\hat\pi_n(\cdot|h_n)$ when $h_n\in\textbf{H}_n$ .

Hence, there is a bijection between $\hat{\boldsymbol{\Pi}}$ and $\boldsymbol{\Pi}$ . Note that this establishes a bijection between $\boldsymbol{\Pi}_s$ and $\hat{\boldsymbol{\Pi}}_{s}$ as well.

For a complete overview of techniques for solving total-expected-reward MDPs, we refer to [Reference Hernández-Lerma and Lasserre19, Chapter 9]. In that reference, the value iteration algorithm, the policy iteration algorithm, and the linear programming approach are discussed. Other related results on value iteration are described in Theorems 3.5 and 3.7 in [Reference Dufour and Prieto-Rumeau15], and the linear programming (or convex analytic) approach is also studied in [Reference Dufour, Horiguchi and Piunovskiy13, Reference Dufour and Piunovskiy14].

The relation between $\mathcal{M}$ and $\mathcal{M}_{\beta}$ . Next we state a result that allows us to establish a correspondence between the performance functions of the two models $\mathcal{M}$ and $\mathcal{M}_{\beta}$ . Roughly, this will imply that for any pair of control policies on correspondence (recall Remark 3.1), the performance criterion of the model $\mathcal{M}_{\beta}$ multiplied by $1-\beta$ will converge, as the parameter $\beta$ tends towards 1, to the performance criterion of the original model $\mathcal{M}$ . This very important fact establishes the link from $\mathcal{M}$ to a criterion of an additive nature, for which it will be possible to use standard methods of analysis. In our paper, we use the so-called direct approach, along the lines developed in [Reference Balder3, Reference Balder5, Reference Nowak23, Reference Schäl28, Reference Schäl29]; roughly speaking, this consists in finding a suitable topology on the set of strategic probability measures to ensure that it is compact and that the reward functional is semicontinuous.

The next result is proved in [Reference Dufour and Prieto-Rumeau15] in the context of a countable state space. It can easily be generalized to the framework of our paper, namely, a general state space of Borel type. In order to avoid unnecessary repetition we will omit the proof here; we refer the reader to Section 3 of [Reference Dufour and Prieto-Rumeau15]. In the sequel we make the convention that $\beta^\infty=0$ for $\beta\in(0, 1)$ .

Proposition 3.1. For every $0<\beta<1$ , $\pi\in\boldsymbol{\Pi}$ , and $n\in\mathbb{N}$ we have

\begin{align*}\frac{1}{1-\beta}\cdot \mathbb{E}^{\pi}_{\nu} [1-\beta^{N_G([0,n])}]=\mathbb{E}^{\beta,\pi^{\Delta}}_{\nu}[\hat{N}_G([0,n])]\quad and \quad\frac{1}{1-\beta}\cdot \mathbb{E}^{\pi}_{\nu} [1-\beta^{N_G}]=\mathbb{E}^{\beta,\pi^{\Delta}}_{\nu} [\hat{N}_G].\end{align*}

Remark 3.2.

  1. (a) Since $\hat{\boldsymbol{\Pi}}=\big\{ \pi^{\Delta} \,:\, \pi\in\boldsymbol{\Pi}\big\}$ , for every $0<\beta<1$ , $\hat{\pi}\in\hat{\boldsymbol{\Pi}}$ , and $n\in\mathbb{N}$ we have

    \begin{align*}\mathbb{E}^{\beta,\hat{\pi}}_{\nu} [\hat{N}_G([0,n])]=\frac{1}{1-\beta}\cdot \mathbb{E}^{\hat{\pi}^{|\textbf{X}}}_{\nu} [1-\beta^{N_G([0,n])}]\end{align*}
    and
    \begin{align*}\mathbb{E}^{\beta,\hat{\pi}}_{\nu} [\hat{N}_G]=\frac{1}{1-\beta}\cdot \mathbb{E}^{\hat{\pi}^{|\textbf{X}}}_{\nu} [1-\beta^{N_G}].\end{align*}
  2. (b) Since $1-\beta^{N_G}$ decreases to $\textbf{I}_{\{N_{G}=\infty\}}$ as $\beta$ increases to 1, by dominated convergence we have for any $\pi\in\boldsymbol{\Pi}$

    \begin{align*}\lim_{\beta\uparrow1} (1-\beta) \mathbb{E}^{\beta,\pi^{\Delta}}_{\nu} [\hat{N}_G]=\lim_{\beta\uparrow1} \mathbb{E}^{\pi}_{\nu}[1-\beta^{N_G}]=\mathbb{P}^{\pi}_{\nu}\{N_{G}=\infty\}.\end{align*}
    So the performance criterion of $\mathcal{M}_{\beta}$ multiplied by $1-\beta$ converges as $\beta\uparrow1$ to that of $\mathcal{M}$ for each single control policy.

Assumptions on $\mathcal{M}_{\beta}$ . We define the time of entrance of the state process $\{\hat{X}_n\}_{n\in\mathbb{N}}$ into a subset $\mathcal{C}\in\boldsymbol{\mathfrak{B}}(\hat{\textbf{X}})$ of the state space as the random variable $\tau_{\mathcal{C}}$ defined on $\hat{\boldsymbol{\Omega}}$ by

\begin{align*}\tau_{\mathcal{C}}(\omega)=\min\{n\in\mathbb{N}\,:\, \hat{X}_n(\omega)\in\mathcal{C}\},\end{align*}

with the usual convention that the minimum over the empty set is $\infty$ .

For a stationary Markov policy $\pi\in\boldsymbol{\Pi}_s$ , we define the stochastic kernel $Q_{\pi}$ on $\textbf{X}$ given $\textbf{X}$ as

\begin{align*}Q_{\pi}(\Gamma|x) = \int_{\textbf{X}} Q(\Gamma|x,a)\pi(da|x)\quad\hbox{for $x\in\textbf{X}$ and $\Gamma\in\boldsymbol{\mathfrak{B}}(\textbf{X})$}.\end{align*}

For $n\ge1$ , we denote by $Q^n_{\pi}$ the nth composition of $Q_{\pi}$ with itself, and we make the convention that $Q^0_{\pi}(\cdot|x)=\delta_x(\cdot)$ . Similarly, for a stationary Markov policy $\hat{\pi}\in\hat{\boldsymbol{\Pi}}_s$ of the auxiliary control model, we define the stochastic kernel $Q_{\beta,\hat{\pi}}$ on $\hat{\textbf{X}}$ given $\hat{\textbf{X}}$ and $Q^n_{\beta,\hat{\pi}}$ for $n\ge0$ .

Assumption B. For each $0<\beta<1$ and $\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}$ , there exists a measurable set $\mathcal{C}^{\beta,\hat{\pi}}\subseteq \hat{\textbf{X}}$ satisfying $\hat{\lambda}(G\cap \mathcal{C}^{\beta,\hat{\pi}})=0$ and $Q_{\beta,\hat{\pi}}(\mathcal{C}^{\beta,\hat{\pi}} |x)=1$ for $\hat{\lambda}$ -almost all $x\in\mathcal{C}^{\beta,\hat{\pi}}$ . In addition, there exists a finite constant $K_{\beta}$ such that for every $\hat\pi\in\hat{\boldsymbol{\Pi}}_s$ ,

(3.2) \begin{align}\mathbb{E}^{\beta,\hat{\pi}}_{x}[\tau_{\mathcal{C}^{\beta,\hat{\pi}}}] \leq K_{\beta}\quad\hbox{for $\hat{\lambda}$-almost all $x\in\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}$.} \end{align}

Observe that, under Assumption B, the control model $\mathcal{M}_{\beta}$ is not necessarily an absorbing MDP in the terminology of [Reference Feinberg and Rothblum17, Section 2], since the set $\mathcal{C}^{\beta,\hat{\pi}}$ depends on the control policy. The stability Assumption B can be informally and very loosely explained as follows. Given any fixed $0<\beta<1$ , for each stationary Markov policy $\hat\pi\in\hat{\boldsymbol{\Pi}}_s$ there exists a set $\mathcal{C}^{\beta,\hat{\pi}}$ which is closed (not in the topological sense, but in the terminology of Markov chains) and disjoint from G, and which is reached in a bounded expected time when starting from outside it (formally, all these statements hold almost surely with respect to the reference measure $\hat\lambda$ ).

We now state some consequences of Assumption B.

Lemma 3.1. Let Assumption B be satisfied. Then for any $x\in\hat{\textbf{X}}$ , $\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}$ , and $n\ge1$ we have $Q^n_{\beta,\hat{\pi}}(\cdot|x)\ll \hat{\lambda}$ . In particular, the distribution of $\hat{X}_{n}$ under $\mathbb{P}^{\beta,\hat{\pi}}_{\nu}$ is absolutely continuous with respect to $\hat{\lambda}$ for any $n\in\mathbb{N}$ .

Proof. Indeed, if $\Gamma\in\boldsymbol{\mathfrak{B}}(\hat{\textbf{X}})$ is such that $\hat\lambda(\Gamma)=0$ , then necessarily $\Gamma\subseteq\textbf{X}$ with $\lambda(\Gamma)=0$ . By definition of the stochastic kernel $Q_{\beta}$ , and recalling (2.1), it is then clear that $Q_{\beta,\hat\pi}(\Gamma|x)=0$ for every $x\in\hat{\textbf{X}}$ . Now, supposing again that $\Gamma\in\boldsymbol{\mathfrak{B}}(\hat{\textbf{X}})$ is such that $\hat\lambda(\Gamma)=0$ , and letting $n\ge1$ and $x\in\textbf{X}$ , we have

\begin{align*}Q_{\beta,\hat\pi}^{n+1}(\Gamma|x)=\int_{\hat{\textbf{X}}} Q_{\beta,\hat\pi}(\Gamma|y) Q^n_{\beta,\hat\pi}(dy|x),\end{align*}

which indeed equals 0 because $Q_{\beta,\hat\pi}(\Gamma|y)=0$ .

The second statement is a straightforward consequence of the above results except for the case $n=0$ . In that case, however, the distribution of $\hat{X}_0$ is $\nu\ll\lambda$ .

The fact that $\mathcal{C}^{\beta,\hat\pi}$ is closed $\hat\lambda$ -almost surely under $\hat\pi$ is extended to further transitions.

Lemma 3.2. Suppose that Assumption B holds. Given arbitrary $\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}$ and $n\ge1$ , we have that $Q^{n}_{\beta,\hat{\pi}}(\mathcal{C}^{\beta,\hat{\pi}}|x)=1$ for $\hat{\lambda}$ -almost all $x\in\mathcal{C}^{\beta,\hat{\pi}}$ .

Proof. Let us prove the result by induction. By Assumption B, it holds for $n=1$ . Now assume that the claim holds for some $n\ge1$ . Let $C_n\subseteq \mathcal{C}^{\beta,\hat\pi}$ , with $\hat\lambda(\mathcal{C}^{\beta,\hat\pi}-C_n)=0$ , be the set of $x\in\hat{\textbf{X}}$ for which $Q_{\beta,\hat\pi}(\mathcal{C}^{\beta,\hat\pi}|x)=1$ and $Q^n_{\beta,\hat\pi}(\mathcal{C}^{\beta,\hat\pi}|x)=1$ . If $x\in C_n$ we have

\begin{align*}Q^{n+1}_{\beta,\hat{\pi}}(\mathcal{C}^{\beta,\hat{\pi}}|x)\ge\int_{C_n} Q^{n}_{\beta,\hat{\pi}}(\mathcal{C}^{\beta,\hat{\pi}}|y) Q_{\beta,\hat{\pi}}(dy|x)=Q_{\beta,\hat{\pi}}(C_n|x).\end{align*}

Since $Q_{\beta,\hat\pi}(\cdot|x)\ll \hat\lambda$ (recall Lemma 3.1), it follows that

\begin{align*}1=Q_{\beta,\hat\pi}(C_n|x)+Q_{\beta,\hat\pi}(\mathcal{C}^{\beta,\hat\pi}-C_n|x)=Q_{\beta,\hat\pi}(C_n|x),\end{align*}

giving the result.

We conclude this section by proposing a sufficient condition for Assumption B. It imposes a drift towards a closed set with $\hat\lambda$ -null intersection with G by using some standard Foster–Lyapunov criteria. Such conditions are well known in the field of Markov chains. They provide easily verifiable conditions for testing stability (existence of invariant measure, recurrence, ergodicity, etc.).

Proposition 3.2. Suppose that for each $0<\beta<1$ and $\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}$ there exist a constant $\gamma^{\beta,\hat{\pi}}\ge0$ and the following:

  • a measurable function $W^{\beta,\hat{\pi}}\,:\, \hat{\textbf{X}} \rightarrow [0,+\infty]$ satisfying

    (3.3) \begin{eqnarray} Q_{\beta,\hat{\pi}} W^{\beta,\hat{\pi}}(x) \leq W^{\beta,\hat{\pi}}(x)+\gamma^{\beta,\hat{\pi}} \quad for\ \hat{\lambda}-almost\ all\ x\in\hat{\textbf{X}} \end{eqnarray}
    and such that the set $\mathcal{C}^{\beta,\hat{\pi}}$ defined as $\mathcal{C}^{\beta,\hat{\pi}}=\{x\in\hat{\textbf{X}}\,:\, W^{\beta,\hat{\pi}}(x) <+\infty\}$ satisfies $\hat{\lambda}(\mathcal{C}^{\beta,\hat{\pi}})>0$ and $\hat{\lambda}(G\cap \mathcal{C}^{\beta,\hat{\pi}})=0$ ;
  • a measurable function $V^{\beta,\hat{\pi}}\,:\, \hat{\textbf{X}} \rightarrow [0,+\infty]$ and a constant $K_{\beta}<+\infty$ satisfying

    (3.4) \begin{eqnarray}V^{\beta,\hat{\pi}}(x) \leq K_{\beta}\quad for\ \hat{\lambda}-almost\ all\ x\in\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}} \end{eqnarray}
    and
    (3.5) \begin{eqnarray}Q_{\beta,\hat{\pi}} V^{\beta,\hat{\pi}}(x) \leq V^{\beta,\hat{\pi}}(x)-1+ \gamma^{\beta,\hat{\pi}} \textbf{I}_{\mathcal{C}^{\beta,\hat{\pi}}}(x) \quad for\ \hat{\lambda}-almost\ all\ x\in\hat{\textbf{X}}.\end{eqnarray}

Under these conditions, Assumption B is satisfied.

Proof. Let $0<\beta<1$ and $\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}$ . The inequality (3.3) implies that

\begin{align*}\int_{\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}} W^{\beta,\hat{\pi}}(y) Q_{\beta,\hat{\pi}}(dy|x) \leq W^{\beta,\hat{\pi}}(x)+\gamma^{\beta,\hat{\pi}}\quad\hbox{for $\hat{\lambda}$-almost all $x\in\hat{\textbf{X}}$}.\end{align*}

But since $W^{\beta,\hat\pi}(y)$ is infinite when $y\notin\mathcal{C}^{\beta,\hat\pi}$ and $W^{\beta,\hat\pi}(x)$ is finite on $\mathcal{C}^{\beta,\hat\pi}$ , this implies that $Q_{\beta,\hat{\pi}}(\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}|x)=0$ and so $Q_{\beta,\hat{\pi}}(\mathcal{C}^{\beta,\hat{\pi}}|x)=1$ for $\hat{\lambda}$ -almost all $x\in\mathcal{C}^{\beta,\hat{\pi}}$ . This establishes the first part of Assumption B. Now, following the proof of Lemma 11.3.6 in [Reference Meyn and Tweedie22] and combining (2.1) and (3.5), we get $\mathbb{E}^{\beta,\hat{\pi}}_{x}[\tau_{\mathcal{C}^{\beta,\hat{\pi}}}] \leq V^{\beta,\hat{\pi}}(x) $ for $\hat{\lambda}$ -almost all $x\in\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}$ . The stated result follows from (3.4).

We note that the sufficient conditions given in Proposition 3.2 above are, in general, more easily verifiable than Assumption B. Indeed, Proposition 3.2 involves just the one-step transitions of the system (see (3.3) and (3.5)). On the other hand, checking Assumption B directly requires one to compute the expected hitting time of the set $\mathcal{C}^{\beta,\hat\pi}$ , which in principle requires knowledge of the distribution of the Markov chain over the whole time horizon, and then check that these expected hitting times are bounded uniformly in the initial state of the system.

It is worth mentioning that the condition (3.3) yields, loosely, the partition between recurrent and transient states of the Markov chain, while (3.5) is closely related to the computation of the expected absorption time by the set of recurrent states of the Markov chain.

Next, in Example 3.1, we illustrate the verification of the conditions in Proposition 3.2; see also Example 4.1 below.

Example 3.1. (Taken from [Reference Dufour and Prieto-Rumeau15, Example 4.6].) Consider the state space $\textbf{X}=\{0,1,2\}$ with no actions at states 0 and 2, and $\textbf{A}(1)=\{a_1,a_2\}$ . The transition probabilities from state 0 are $q(0,0)=q(0,2)=1/2$ , and from state 2 they are $q(2,2)=1$ (we do not make any action explicit in this notation, since there are no actions at 0 and 2). The transitions from state 1 are $q(1,a_1,0)=2/3$ and $q(1,a_1,1)=1/3$ under $a_1$ , and $q(1,a_2,0)=q(1,a_2,1)=$ $q(1,a_2,2)=1/3$ under $a_2$ . We let $G=\{0\}$ . We can identify $\hat{\boldsymbol{\Pi}}_s$ with the interval [0, 1], where $a\in[0, 1]$ is the probability of selecting $a_1$ and $1-a$ is the probability of choosing action $a_2$ . Let 1 be the initial state of the system.

In order to verify Assumption B, we will use the sufficient conditions in Proposition 3.2. After adding the cemetery state $\Delta$ , the transition matrix for the stationary policy $a\in[0, 1]$ in the model $\mathcal{M}_{\beta}$ is

\begin{align*}\begin{pmatrix} \beta/2 & \quad 0 & \quad \beta/2 & \quad 1-\beta \\(1+a)/3 & \quad 1/3 & \quad (1-a)/3 & \quad 0 \\0 & \quad 0 & \quad 1 & \quad 0 \\0 & \quad 0 & \quad 0 & \quad 1\end{pmatrix},\end{align*}

where the states are ordered $0,1,2,\Delta$ .

Letting $W^{\beta,a}=(+\infty,+\infty,1,1)$ and $\gamma^{\beta,a}=1$ , we observe that (3.3) in Proposition 3.2 holds with, therefore, $\mathcal{C}^{\beta,a}=\{2,\Delta\}$ (roughly, W equals 1 on the recurrent states of the Markov chain, and it is infinite on the transient states). Regarding the inequalities (3.5), we obtain their minimal nonnegative solution, which is

\begin{align*}V^{\beta,a}(0)=\frac{2}{2-\beta},\qquad V^{\beta,a}(1)=\frac{(1+a)}{2-\beta}+\frac{3}{2},\qquad V^{\beta,a}(2)=V^{\beta,a}(\Delta)=0.\end{align*}

This minimal nonnegative solution is precisely the expected time the Markov chain spends in the transient states.

Solving the total-expected-reward MDP model $\mathcal{M}_{\beta}$ yields the optimal policy $\hat\pi^*_{\beta}=1$ with $V^*_{\beta}(1)=\frac{2}{2-\beta}+\frac{3}{2}$ for every $0<\beta<1$ .

4. Main results

In this section we prove our results on the solution of the control model $\mathcal{M}$ .

Proposition 4.1. Under Assumptions A and B, for each $0<\beta<1$ the mapping

\begin{align*}P\mapsto \int_{\boldsymbol{\Omega}} (1-\beta^{N_G})\textrm{d} P\end{align*}

is lower semicontinuous on $\boldsymbol{\mathcal{S}}_{\nu}$ and continuous on $\boldsymbol{\mathcal{S}}_{\nu}^{s}$ .

Proof. First we prove the following preliminary fact:

(4.1) \begin{eqnarray}\lim_{n\rightarrow\infty}\sup_{\hat\pi\in\hat{\boldsymbol{\Pi}}_s} \mathbb{E}^{\beta,\hat{\pi}}_{\nu}[\hat{N}_G([n,\infty))]=0\end{eqnarray}

for any $0<\beta<1$ . Observe that for arbitrary $\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}$ and $m\geq n\ge0$ ,

\begin{eqnarray*}\mathbb{P}^{\beta,\hat{\pi}}_{\nu}\{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}=n,\hat{X}_{m}\in G\}&\,\leq\,&\mathbb{P}^{\beta,\hat\pi}_{\nu}\{\hat{X}_{n}\in\mathcal{C}^{\beta,\hat{\pi}},\hat{X}_{m}\in G \}\\&=& \,\mathbb{E}^{\beta,\hat\pi}_{\nu}[ Q^{m-n}_{\beta,\hat{\pi}}(G|\hat{X}_{n}) \textbf{I}_{\mathcal{C}^{\beta,\hat{\pi}}}(\hat{X}_{n})].\end{eqnarray*}

Recalling that $\hat{\lambda}(G\cap\mathcal{C}^{\beta,\hat{\pi}})=0$ , Lemmas 3.1 and 3.2 imply that the above expectation vanishes, and so $\mathbb{P}^{\beta,\hat{\pi}}_{\nu}\{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}=n,\hat{X}_{m}\in G\}=0$ . Therefore, we have

\begin{align*}\hat{N}_G([n,\infty))= \textbf{I}_{\{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}>n\}} \sum_{m=n}^{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}-1}\textbf{I}_{G}(\hat{X}_{m}) \le (\tau_{\mathcal{C}^{\beta,\hat{\pi}}}-n)^+ \quad \mathbb{P}^{\beta,\hat{\pi}}_{\nu}\text{-almost surely,}\end{align*}

and so

\begin{eqnarray*} \mathbb{E}^{\beta,\hat{\pi}}_{\nu}[\hat{N}_G([n,\infty))] &\,\le\, & \sum_{m>n}^\infty(m-n) \mathbb{P}^{\beta,\hat{\pi}}_{\nu} \{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}=m\} \\&=& \, \sum_{m>n}^\infty(m-n) \mathbb{E}^{\beta,\hat{\pi}}_{\nu} \big[ \textbf{I}_{\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}}(\hat{X}_{n}) \;\mathbb{P}^{\beta,\hat{\pi}}_{\hat{X}_{n}} \{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}=m-n\} \big]\end{eqnarray*}

by the Markov property. Then it follows that

\begin{eqnarray*} \mathbb{E}^{\beta,\hat{\pi}}_{\nu}[\hat{N}_G([n,\infty))] &\,\le\, & \mathbb{E}^{\beta,\hat{\pi}}_{\nu} \big[ \textbf{I}_{\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}}(\hat{X}_{n}) \cdot \sum_{m>n}^\infty(m-n) \mathbb{P}^{\beta,\hat{\pi}}_{\hat{X}_{n}} \{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}=m-n\} \big] \\ &=& \, \mathbb{E}^{\beta,\hat{\pi}}_{\nu} \big[ \textbf{I}_{\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}}(\hat{X}_{n}) \cdot\mathbb{E}^{\beta,\hat{\pi}}_{\hat{X}_{n}} [\tau_{\mathcal{C}^{\beta,\hat{\pi}}} ] \big] \\ &\le&\, \mathbb{E}^{\beta,\hat{\pi}}_{\nu} \big[ \textbf{I}_{\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat{\pi}}}(\hat{X}_{n}) \cdot K_{\beta}\big],\end{eqnarray*}

where the last inequality is obtained because $\mathbb{E}^{\beta,\hat{\pi}}_{x} [\tau_{\mathcal{C}^{\beta,\hat{\pi}}} ]\le K_{\beta}$ for $\hat\lambda$ -almost every $x\in\hat{\textbf{X}}-\mathcal{C}^{\beta,\hat\pi}$ —recall (3.2)—and because the distribution of $\hat X_n$ is absolutely continuous with respect to $\hat\lambda$ —see Lemma 3.1. From the Markov inequality we derive that

\begin{align*} \mathbb{E}^{\beta,\hat{\pi}}_{\nu}[\hat{N}_G([n,\infty))] \leq K_{\beta} \cdot \mathbb{P}^{\beta,\hat{\pi}}_{\nu} \{\tau_{\mathcal{C}^{\beta,\hat{\pi}}}>n\} \leq K_{\beta}^2 /n;\end{align*}

note that this bound does not depend on $\hat\pi\in\hat{\boldsymbol{\Pi}}_s$ . This shows (4.1).

We now proceed with the proof. For each fixed $n\ge0$ , the mapping on $(\textbf{X}\times\textbf{A})^{n-1}\times\textbf{X}$ given by

\begin{align*}(x_0,a_0,\ldots,x_n)\mapsto\prod_{0\le j\le n}\beta^{\textbf{I}_{G}(x_j)}=\beta^{N_G([0,n])}\end{align*}

is in $\boldsymbol{\mathcal{U}}_n$ . Therefore, by the definition of the $ws^\infty$ -topology, the function

(4.2) \begin{equation}P\mapsto \int_{\boldsymbol{\Omega}} (1-\beta^{N_G([0,n])})\textrm{d} P\end{equation}

is continuous on $\boldsymbol{\mathcal{S}}_{\nu}$ , and so the function $P\mapsto\int_{\boldsymbol{\Omega}}(1-\beta^{N_G})\textrm{d} P$ is lower semicontinuous on $\boldsymbol{\mathcal{S}}_{\nu}$ , since it is the increasing limit as $n\rightarrow\infty$ of continuous functions.

Now we prove upper semicontinuity on $\boldsymbol{\mathcal{S}}_{\nu}^{s}$ . Fix arbitrary $0\le n\le m$ and $P\in \boldsymbol{\mathcal{S}}_{\nu}^{s}$ such that $P=\mathbb{P}^{\pi}_{\nu}$ for some $\pi\in\boldsymbol{\Pi}_{s}$ . Using Proposition 3.1,

\begin{eqnarray*}\int_{\boldsymbol{\Omega}} (1-\beta^{N_G([0,m])}) \textrm{d}\mathbb{P}^{\pi}_{\nu}\, &=& \, (1-\beta)\mathbb{E}^{\beta,\pi^{\Delta}}_{\nu}[\hat{N}_G([0,m])] \\&\le&\, (1-\beta) \mathbb{E}^{\beta,\pi^{\Delta}}_{\nu}[\hat{N}_G([0,n])] \\&&+\, (1-\beta)\sup_{\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}} \mathbb{E}^{\beta,\hat\pi}_{\nu}[\hat{N}_G([n+1,\infty))]\\&=& \, \int_{\boldsymbol{\Omega}} (1-\beta^{N_G([0,n])})\textrm{d} \mathbb{P}^{\pi}_{\nu}+\epsilon_n,\end{eqnarray*}

where we have defined

\begin{align*}\epsilon_n= (1-\beta)\sup_{\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}} \mathbb{E}^{\beta,\hat{\pi}}_{\nu}[\hat{N}_G([n+1,\infty))],\end{align*}

with $\epsilon_n\rightarrow0$ as a consequence of (4.1). By the result in [Reference Schäl27, Proposition 10.1], it follows that the limit as $n\rightarrow\infty$ of the function (4.2) is an upper semicontinous function on $\boldsymbol{\mathcal{S}}_{\nu}^{s}$ . This completes the proof of the continuity statement on $\boldsymbol{\mathcal{S}}_{\nu}^{s}$ .

Proposition 4.2. Suppose that Assumptions A and B are satisfied and fix $0<\beta<1$ . There exists a stationary policy $\hat{\pi}^*$ in $\hat{\boldsymbol{\Pi}}_{s}$ that is optimal for the control model $\mathcal{M}_{\beta}$ , i.e.,

\begin{align*}V^*_{\beta} (\nu)=\sup_{\hat{\pi}\in\hat{\boldsymbol{\Pi}}} \mathbb{E}^{\beta,\hat{\pi}}_{\nu} [\hat{N}_G]= \mathbb{E}^{\beta,\hat{\pi}^*}_{\nu}[\hat{N}_G].\end{align*}

Proof. Using the result in [Reference Schäl30, p. 368], we have

\begin{align*}\sup_{\hat{\pi}\in\hat{\boldsymbol{\Pi}}} \mathbb{E}^{\beta,\hat{\pi}}_{\nu}[\hat{N}_G]=\sup_{\hat{\pi}\in\hat{\boldsymbol{\Pi}}_{s}} \mathbb{E}^{\beta,\hat{\pi}}_{\nu}[\hat{N}_G].\end{align*}

Consequently, from Proposition 3.1 we obtain

\begin{align*} \sup_{\pi\in\boldsymbol{\Pi}} \mathbb{E}^{\pi}_{\nu}[1-\beta^{N_G}]=\sup_{\pi\in\boldsymbol{\Pi}_{s}} \mathbb{E}^{\pi}_{\nu}[1-\beta^{N_G}]=\sup_{P\in\boldsymbol{\mathcal{S}}_{\nu}^{s}} \int (1-\beta^{N_G})\textrm{d} P.\end{align*}

But the latter supremum is attained at some $P^*\in\boldsymbol{\mathcal{S}}^s_{\nu}$ , because it is the maximum of a continuous function (Proposition 4.1) on a compact set (Theorem 2.1). If $P^*=\mathbb{P}^{\pi_*}_{\nu}$ for some $\pi_*\in\boldsymbol{\Pi}_s$ , then we have that $\pi_*^\Delta\in\hat{\boldsymbol{\Pi}}_s$ is an optimal stationary Markov policy for the control problem  $\mathcal{M}_{\beta}$ .

We are now in position to prove the main result of this paper, on maximizing the probability of visiting the set G infinitely often.

Theorem 4.1. Suppose that Assumptions A and B hold.

  1. (i) There is some $\pi\in\boldsymbol{\Pi}_{s}$ that is optimal for the control model $\mathcal{M}$ , that is,

    \begin{align*}\mathbb{P}_{\nu}^{\pi}\{N_{G}=\infty\}=\sup_{\gamma\in\boldsymbol{\Pi}} \mathbb{P}_{\nu}^{\gamma}\{N_{G}=\infty\}=V^*(\nu).\end{align*}
  2. (ii) Consider a sequence $\{\beta_k\}_{k\in\mathbb{N}}$ increasing to 1 and let $\hat{\pi}_{k}\in\hat{\boldsymbol{\Pi}}_{s}$ be an optimal policy for the model $\mathcal{M}_{\beta_k}$ . If $\hat{\pi}^{|\textbf{X}}_{k}\rightarrow{\pi}\in\boldsymbol{\Pi}_{s}$ in the narrow topology, then $\pi$ is optimal for $\mathcal{M}$ .

  3. (iii) We have $\lim_{\beta\uparrow1} (1-\beta)V^*_{\beta}(\nu)=V^*(\nu)$ .

Proof. (i) Let $\{\beta_k\}_{k\in\mathbb{N}}$ be a sequence in (0, 1) with $\beta_k\uparrow1$ , and let the policy $\hat{\pi}_{k}\in\hat{\boldsymbol{\Pi}}_{s}$ be optimal for $\mathcal{M}_{\beta_k}$ . Such a policy exists by Proposition 4.2. Consider now the sequence $\{\pi_{k}\}_{k\in\mathbb{N}}$ in $\boldsymbol{\Pi}_{s}$ where $\pi_{k}=\hat{\pi}^{|\textbf{X}}_{k}$ for $k\in\mathbb{N}$ . The set $\boldsymbol{\Pi}_s$ being compact (Lemma 2.2), we have that $\{\pi_{k}\}_{k\in\mathbb{N}}$ has a convergent subsequence, and we can assume without loss of generality that the whole sequence $\{\pi_{k}\}_{k\in\mathbb{N}}$ converges to some $\pi\in\boldsymbol{\Pi}_s$ . In particular, by Lemma 2.3, we also have $\mathbb{P}_{\nu}^{\pi_{k}}\Rightarrow \mathbb{P}_{\nu}^{\pi}$ .

By Proposition 3.1 it follows that for arbitrary $\gamma\in\boldsymbol{\Pi}$ ,

\begin{eqnarray*}\mathbb{E}_{\nu}^{\pi_k}[1-\beta_k^{N_G}] &=& \, (1-\beta_k) \mathbb{E}^{\beta_k,\hat{\pi}_{k}}_{\nu} [\hat{N}_G]\ =\ (1-\beta_k)V^*_{\beta_k}(\nu) \\ &\geq& (1-\beta_k) \mathbb{E}^{\beta_k,\gamma^{\Delta}}_{\nu} [\hat{N}_G]\ =\ \mathbb{E}_{\nu}^{\gamma}[1-\beta_k^{N_G}].\end{eqnarray*}

Now, by the first statement in Remark 3.2(b), observe that

\begin{align*}\mathbb{E}^{\gamma}_{\nu} [1-\beta_k^{N_G}] \ge\mathbb{P}^{\gamma}_{\nu}\{N_{G}=\infty\},\end{align*}

and so for any $j\leq k$ in $\mathbb{N}$ we obtain the inequalities

\begin{align*}\mathbb{E}_{\nu}^{\pi_k}[1-\beta_j^{N_G}]\ge \mathbb{E}_{\nu}^{\pi_k}[1-\beta_k^{N_G}]=(1-\beta_k)V^*_{\beta_k}(\nu)\ge \mathbb{P}^{\gamma}_{\nu}\{N_{G}=\infty\}.\end{align*}

Taking the limit as $k\rightarrow\infty$ with j fixed in the previous inequalities, we get from Proposition 4.1 that

\begin{equation*}\mathbb{E}_{\nu}^{\pi}[1-\beta_j^{N_G}]\ge \limsup_{k\rightarrow\infty}(1-\beta_k)V^*_{\beta_k}(\nu)\ge \liminf_{k\rightarrow\infty} (1-\beta_k)V^*_{\beta_k}(\nu)\ge\mathbb{P}_{\nu}^{\gamma}\{N_{G}=\infty\}.\end{equation*}

In these inequalities, we take the limit as $j\rightarrow\infty$ and the supremum in $\gamma\in\boldsymbol{\Pi}$ to obtain

\begin{equation*}\mathbb{P}_{\nu}^{\pi}\{N_{G}=\infty\}\ge \limsup_{k\rightarrow\infty}(1-\beta_k)V^*_{\beta_k}(\nu)\ge \liminf_{k\rightarrow\infty} (1-\beta_k)V^*_{\beta_k}(\nu)\ge\sup_{\gamma\in\boldsymbol{\Pi}}\mathbb{P}_{\nu}^{\gamma}\{N_{G}=\infty\}.\end{equation*}

Consequently, the above inequalities are all, in fact, equalities:

(4.3) \begin{equation}\mathbb{P}_{\nu}^{\pi}\{N_{G}=\infty\}=V^*(\nu)= \lim_{k\rightarrow\infty}(1-\beta_k)V^*_{\beta_k}(\nu).\end{equation}

This establishes that the stationary Markov policy $\pi\in\boldsymbol{\Pi}_s$ is optimal for the control model $\mathcal{M}$ and that $(1-\beta_k)V^*_{\beta_k}(\nu)\rightarrow V^*(\nu)$ .

(ii) We can obtain the proof of this statement from the proof of (i), just by assuming that the whole sequence $\{\pi_k\}$ converges (and not invoking a convergent subsequence).

(iii) Using Proposition 3.1 we obtain that

\begin{align*}(1-\beta)V^*_{\beta}(\nu)=\sup_{\gamma\in\boldsymbol{\Pi}} \mathbb{E}_{\nu}^{\gamma}[1-\beta^{N_G}].\end{align*}

This implies that $(1-\beta)V^*_{\beta}(\nu)$ decreases as $\beta\uparrow1$ , and so $\lim_{\beta\uparrow1} (1-\beta)V^*_{\beta}(\nu)$ exists. Since we have convergence through some sequence of $\beta_k\uparrow1$ (recall (4.3)), the stated result follows.

Summarizing, there indeed exists a stationary Markov policy in $\boldsymbol{\Pi}_{s}$ that maximizes the probability of visiting the set G infinitely often. This policy is obtained as an accumulation/limit point as $\beta\uparrow1$ in the sense of convergence of Young measures of optimal stationary Markov policies for the total-expected-reward problem $\sup_{\hat{\pi}\in\hat{\boldsymbol{\Pi}}} \mathbb{E}_{\nu}^{\beta,\hat{\pi}}[\hat{N}_G]$ of the control model $\mathcal{M}_{\beta}$ .

Example 4.1. Consider a control model with state space $\textbf{X}=\{0,c,g,g_0,g_1,g_2,\ldots\}$ . There are two actions available at state 0, i.e. $\textbf{A}(0)=\{a_1,a_2\}$ , and there are no actions at the other states. The transition probabilities are $q(0,a_1,g_0)=1$ , $q(0,a_2,c)=q(0,a_2,g)=1/2$ . From state $g_i$ , the transition probabilities are

\begin{align*}q(g_i,g_{i+1})=\frac{1}{i+1}\quad\hbox{and}\quad q(g_i,c)=\frac{i}{i+1}\quad\hbox{for $i\ge0$},\end{align*}

while c and g are absorbing. Let $G=\{g,g_0,g_1,\ldots\}$ and let 0 be the initial state of the system. It should be clear that Assumption A holds.

Consider now the control model $\mathcal{M}_{\beta}$ . We identify $\hat{\boldsymbol{\Pi}}_s$ with the interval [0, 1], where $0\le a\le 1$ stands for the probability of choosing action $a_1$ at state 0. Let us verify the conditions in Proposition 3.2. Concerning (3.3), we put $W^{\beta,a}(c)=W^{\beta,a}(\Delta)=1$ and we let $W^{\beta,a}$ be $+\infty$ in the remaining states, thus obtaining $\mathcal{C}^{\beta,a}=\{c,\Delta\}$ , with $\gamma^{\beta,a}=1$ . This identifies the transient and recurrent states of the Markov chain. For (3.5), we let $V^{\beta,a}(c)=V^{\beta,a}(\Delta)=0$ , and we try to find the minimal solution of (3.5). These inequalities become

\begin{align*} \frac{1}{2}(1-a)V^{\beta,a}(g)+a V^{\beta,a}(g_0)\le V^{\beta,a}(0)-1,\quad \beta V^{\beta,a}(g)\le V^{\beta,a}(g)-1, \end{align*}

and

\begin{align*} \frac{\beta}{i+1}V^{\beta,a}(g_{i+1})\le V^{\beta,a}(g_i)-1\quad\hbox{for $i\ge0$}.\end{align*}

By iteration of the latter inequalities we obtain

\begin{align*}\frac{\beta^{i+1}}{(i+1)!}V^{\beta,a}(g_{i+1})\le V^{\beta,a}(g_0)-\sum_{j=0}^i \frac{\beta^j}{j!}.\end{align*}

The minimal solution is attained when $V^{\beta,a}(0)=e^\beta$ , and thus for any $i\ge0$

\begin{align*}V^{\beta,a}(g_{i})=1+\frac{\beta}{k+1}+\frac{\beta^2}{(k+2)(k+1)}+\frac{\beta^3}{(k+3)(k+2)(k+1)}+\ldots,\end{align*}

$V^{\beta,a}(g)=\frac{1}{1-\beta}$ , and

\begin{align*}V^{\beta,a}(0)=1+\frac{1-a}{2(1-\beta)}+ae^\beta.\end{align*}

We note that this minimal solution $V^{\beta,a}(x)$ is precisely the expected time the process spends in the transient states of the chain when starting from $x\in\textbf{X}$ . We also note that $V^{\beta,a}(x)$ is bounded in x (recall (3.4)): to see this, observe that $V^{\beta,a}(g_i)\le e^\beta$ for every $i\ge0$ . It should be clear that the optimal value function of the control model $\mathcal{M}_{\beta}$ for the initial state 0 is

\begin{align*}V^*_{\beta}(0)=\max\Big\{\frac{1}{2(1-\beta)},e^\beta\Big\}\end{align*}

and that the optimal policy is $\hat\pi^*_{\beta}=1$ for $0<\beta\le \hat\beta$ and $\hat\pi^*_{\beta}=0$ for $\hat\beta\le \beta<1$ , where $\hat\beta\simeq 0.76804$ is the unique solution in (0, 1) of $\frac{1}{2(1-z)}=e^{z}$ . The interesting feature of this example is that the optimal policy $\pi^*_{\beta}$ of $\mathcal{M}_{\beta}$ varies with $\beta$ (cf. Example 3.1).

From Theorem 4.1(ii)–(iii) we obtain that

\[0=\pi^*=\lim_{\beta\uparrow1} \hat\pi^{*|\textbf{X}}_{\beta}\]

is an optimal policy for $\mathcal{M}$ and that $V^*(1)=\lim_{\beta\uparrow1} (1-\beta)V^*_{\beta}(1)=1/2$ .

Appendix A. Proof of Theorem 2.1

This section is mainly devoted to proving the compactness of $\boldsymbol{\mathcal{S}}_{\nu}^s$ , the compactness of $\boldsymbol{\mathcal{S}}_{\nu}$ being a known result (see the proof of Theorem 2.1 below). In what follows, we suppose that Assumption A holds.

Lemma A.1. Suppose that $\textbf{Z}$ is a Borel space, and let $f\,:\,\textbf{Z}\times\textbf{X}\times\textbf{A}\times\textbf{X}\rightarrow\mathbb{R}$ be bounded, measurable, and continuous on $\textbf{A}$ when the remaining variables $z\in\textbf{Z}$ and $x,y\in\textbf{X}$ are fixed. Define the real-valued functions g and h on $\textbf{X}\times\textbf{K}$ and $\textbf{Z}\times\textbf{X}$ respectively by

\begin{align*} g(z,x,a) = \int_{\textbf{X}} f(z,x,a,y)q(x,a,y)\lambda(dy)\quad and \quad h(z,x)= \max_{a\in\textbf{A}(x)} g(z,x,a) .\end{align*}

Under these conditions, the following hold:

  1. (i) The function g is bounded and measurable on $\textbf{Z}\times\textbf{K}$ , and $g(z,x,\cdot)$ is continuous on $\textbf{A}(x)$ for fixed $(z,x)\in\textbf{Z}\times\textbf{X}$ .

  2. (ii) The function h is bounded and measurable on $\textbf{Z}\times\textbf{X}$ .

Proof. (i) It is clear that g is bounded and measurable. To establish the continuity property, we fix $(z,x)\in\textbf{Z}\times\textbf{X}$ and consider a converging sequence $a_n\rightarrow a$ in $\textbf{A}$ . By the dominated convergence theorem, we have $f(z,x,a_n,\cdot) \ {\stackrel{*}{\rightharpoonup}\ } f(z,x,a,\cdot)$ in the weak $^{*}$ topology $\sigma(L^{\infty}(\textbf{X},\boldsymbol{\mathfrak{B}}(\textbf{X}),\lambda),L^{1}(\textbf{X},\boldsymbol{\mathfrak{B}}(\textbf{X}),\lambda))$ . We also have (recall Remark 2.1(b)) the convergence $q(x,a_{n},\cdot)\stackrel{L^{1}}{\longrightarrow}q(x,a,\cdot)$ in norm in $L^{1}(\textbf{X},\boldsymbol{\mathfrak{B}}(\textbf{X}),\lambda)$ . Using the result in, e.g., [Reference Brezis11, Proposition 3.13.iv], we conclude that

\begin{align*}g(z,x,a_n)= \int_{\textbf{X}} f(z,x,a_n,y)q(x,a_n,y)\lambda(dy)\rightarrow g(z,x,a).\end{align*}

This completes the proof of the statement (i).

  1. (iii) By Assumption (A.1) and the maximum measurable theorem [Reference Aliprantis and Border1, Theorem 18.19], we conclude that h is measurable.

Given $k\ge1$ , we say a function $f\,:\,(\textbf{X}\times\textbf{A})^{k}\times\textbf{X}\rightarrow\mathbb{R}$ is in $\boldsymbol{\mathcal{R}}_k$ if it is of the form

\begin{align*}f(x_0,a_0,\ldots,x_{k-1},a_{k-1},x_k)=\bar{f}(x_0,\ldots,x_k)\bar{f}^{(0)}(a_0)\cdots \bar{f}^{(k-1)}(a_{k-1}),\end{align*}

where $\bar{f}\in\boldsymbol{\mathcal{B}}(\textbf{X}^{k+1})$ and $\bar{f}^{(i)}\in\boldsymbol{\mathcal{C}}(\textbf{A})$ for $0\le i<k$ . For $k=0$ we simply let $\boldsymbol{\mathcal{R}}_0=\boldsymbol{\mathcal{B}}(\textbf{X})$ . Clearly, $\boldsymbol{\mathcal{R}}_k\subseteq\boldsymbol{\mathcal{U}}_k$ . For any integer $k\ge1$ , define the operator $\mathfrak{A}_{k}\,:\, \boldsymbol{\mathcal{R}}_{k}\rightarrow \boldsymbol{\mathcal{R}}_{k-1}$ as follows. If $f\in\boldsymbol{\mathcal{R}}_k$ then $\mathfrak{A}_{k}f$ is given by

(A.1) \begin{align}&(\mathfrak{A}_{k}f)(x_{0},a_{0},\ldots,x_{k-1}) =\max_{a\in\textbf{A}(x_{k-1})} \int_{\textbf{X}} f(x_{0},a_{0},\ldots,x_{k-1},a,y) q(x_{k-1},a,y)\lambda(dy)\nonumber \\&= \prod_{j=0}^{k-2} \bar{f}^{(j)}(a_j)\max_{a\in\textbf{A}(x_{k-1})} \int_{\textbf{X}} \bar{f}(x_{0},\ldots,x_{k-1},y) \bar{f}^{(k-1)}(a)q(x_{k-1},a,y)\lambda(dy)\end{align}

for $(x_0,a_0,\ldots,x_{k-1})\in(\textbf{X}\times\textbf{A})^{k-1}\times\textbf{X}$ , where by convention $\prod_{j=0}^{-1} \bar{f}^{(j)}(a_j)=1$ . Note that Lemma A.1(ii) ensures that $\mathfrak{A}_{k}$ indeed maps $\boldsymbol{\mathcal{R}}_{k}$ into $\boldsymbol{\mathcal{R}}_{k-1}$ because the $\max$ in the right-hand-side of (A.1) is a measurable function of the variables $(x_0,\ldots,x_{k-1})\in\textbf{X}^k$ . In addition, we have that $\| \mathfrak{A}_kf\|\le \|f\|$ in their respective norms in $L^\infty$ . The successive composition of these operators is defined as

\begin{align*}\mathfrak{T}_k=\mathfrak{A}_{1}\circ \ldots\circ\mathfrak{A}_{k-1}\circ\mathfrak{A}_{k}\quad\hbox{for $k\ge1$},\end{align*}

so that $\mathfrak{T}_k\,:\,\boldsymbol{\mathcal{R}}_k\rightarrow\boldsymbol{\mathcal{R}}_0$ . By convention, $\mathfrak{T}_0$ will be the identity operator on $\boldsymbol{\mathcal{R}}_0$ .

Lemma A.2. Given any $k\ge0$ , and arbitrary $f\in\boldsymbol{\mathcal{R}}_k$ and $\pi\in\boldsymbol{\Pi}$ , we have

\[ \int_{(\textbf{X}\times\textbf{A})^{k}\times\textbf{X}} f(h_{k}) \mathbb{P}_{\nu,k}^{\pi}(dh_{k}) \leq \int_{\textbf{X}} (\mathfrak{T}_{k}f)(x_0) \nu(dx_0).\]

Proof. The stated result is trivial for $k=0$ . If $k\ge1$ , observe that we can write

\begin{align*}&\int_{(\textbf{X}\times\textbf{A})^{k}\times\textbf{X}} f(h_{k}) \mathbb{P}_{\nu,k}^{\pi}(h_{k}) = \int_{\textbf{H}_{k}} f(h_{k}) \mathbb{P}_{\nu,k}^{\pi}(h_{k}) \nonumber \\& = \int_{\textbf{H}_{k-1}} \int_{\textbf{A}(x_{k-1})}\int_{\textbf{X}} f(h_{k-1},a,y) q(x_{k-1},a,y) \lambda(dy) \pi_{k-1}(da|h_{k-1})\mathbb{P}_{\nu,k-1}^{\pi}(dh_{k-1}) \nonumber \\&\leq \int_{\textbf{H}_{k-1}} \mathfrak{A}_{k}f(h_{k-1}) \mathbb{P}_{\nu,k-1}^{\pi}(d h_{k-1}).\end{align*}

By iterating this inequality and noting that $\mathbb{P}_{\nu,0}^\pi=\nu$ , we obtain the desired result.

Lemma A.3. For $k\ge1$ , suppose that we are given a sequence $\{v_n\}_{n\in\mathbb{N}}$ in $\boldsymbol{\mathcal{R}}_k$ which is of the form

\begin{align*}v_n(x_0,a_0,\ldots,x_k)=\bar{v}_n(x_0,\ldots,x_k) \cdot \bar{v}^{(0)}(a_0)\cdots\bar{v}^{(k-1)}(a_{k-1}),\end{align*}

where $\bar{v}_n\in\boldsymbol{\mathcal{B}}(\textbf{X}^{k+1})$ satisfies $\sup_n\|\bar{v}_n\|<\infty$ in $L^\infty$ and

\begin{align*} \bar{v}_{n}(x_{0},\ldots,x_{k-1},\cdot) \ {\stackrel{*}{\rightharpoonup}\ }0\quad for\ each\ (x_{0},\ldots,x_{k-1})\in\textbf{X}^k,\end{align*}

and the functions $\bar{v}^{(i)}\in\boldsymbol{\mathcal{C}}(\textbf{A})$ for $0\le i<k$ do not depend on n. Under these conditions, $g_n=\mathfrak{A}_kv_n\in\boldsymbol{\mathcal{R}}_{k-1}$ can be written as

\begin{align*}g_n(x_0,a_0,\ldots,x_{k-1})=\bar{g}_n(x_0,\ldots,x_{k-1})\cdot \bar{v}^{(0)}(a_0)\cdots\bar{v}^{(k-2)}(a_{k-2}),\end{align*}

where $\{\bar{g}_n\}_{n\in\mathbb{N}}$ is a sequence of functions in $\boldsymbol{\mathcal{B}}(\textbf{X}^k)$ with $\sup_n\|\bar{g}_n\|<\infty$ which satisfy that $\bar{g}_n(x_0,\ldots,x_{k-1})\rightarrow0$ for any $(x_0,\ldots,x_{k-1})\in\textbf{X}^{k}$ as $n\rightarrow\infty$ and, in particular,

\begin{align*}\bar{g}_n(x_0,\ldots,x_{k-2},\cdot)\ {\stackrel{*}{\rightharpoonup}\ }0\quad for\ any\ (x_0,\ldots,x_{k-2})\in\textbf{X}^{k-1}.\end{align*}

Proof. The expression given for $g_n$ is easily deduced from (A.1). Moreover, the fact that $\{\bar{g}_n\}$ is bounded in the $L^\infty$ norm is also straightforward. To prove the limit property, we proceed by contradiction: we suppose that for some $(x_0,\ldots,x_{k-1})$ and some subsequence $\{n'\}$ there exists $\epsilon>0$ with

\begin{align*}\left|\max_{a\in\textbf{A}(x_{k-1})} \int_{\textbf{X}} \bar{v}_{n'}(x_{0},\ldots,x_{k-1},y) \bar{v}^{(k-1)}(a)q(x_{k-1},a,y)\lambda(dy) \right|\ge\epsilon.\end{align*}

Assuming that the above maximum is attained (recall Lemma A.1(ii)) at some $a^*_{n'}\in\textbf{A}(x_{k-1})$ , we may also suppose that $a^*_{n'}\rightarrow a^*$ for some $a^*\in\textbf{A}(x_{k-1})$ . Then we have the following convergences:

\begin{align*}\bar{v}_{n'}(x_0,\ldots,x_{k-1},\cdot)\ {\stackrel{*}{\rightharpoonup}\ }0\quad\hbox{and}\quad \bar{v}^{(k-1)}(a^*_{n'})q(x_{k-1},a^*_{n'},\cdot)\stackrel{L^1}{\longrightarrow} \bar{v}^{(k-1)}(a^*)q(x_{k-1},a^*,\cdot).\end{align*}

To check the latter convergence, just note that

\begin{align*}&\int_{\textbf{X}} \big|\bar{v}^{(k-1)}(a^*_{n'})q(x_{k-1},a^*_{n'},y)- \bar{v}^{(k-1)}(a^*)q(x_{k-1},a^*,y)\big|\lambda(dy)\\&\le \big|\bar{v}^{(k-1)}(a^*_{n'})-\bar{v}^{(k-1)}(a^*)\big|\int_{\textbf{X}} q(x_{k-1},a^*_{n'},y)\lambda(dy)\\&+\big|\bar{v}^{(k-1)}(a^*)\big| \int_{\textbf{X}}| q(x_{k-1},a^*_{n'},y)-q(x_{k-1},a^*,y)|\lambda(dy),\end{align*}

which indeed converges to 0 as $n'\rightarrow\infty$ . Applying [Reference Brezis11, Proposition 3.13.iv], we obtain

\begin{align*} \int_{\textbf{X}} \bar{v}_{n'}(x_{0},\ldots,x_{k-1},y)\bar{v}^{(k-1)}(a^*_{n'}) q(x_{k-1},a^*_{n'},y)\lambda(dy)\rightarrow0,\end{align*}

which is a contradiction. The last statement concerning the weak $^*$ convergence follows easily from dominated convergence, since $\{\bar{v}_n\}$ is uniformly bounded in the $L^\infty$ norm. This completes the proof.

To summarize this lemma informally, note that if we have a sequence of the form

\begin{align*}v_n=\bar{v}_n\cdot \bar{v}^{(0)}\cdots\bar{v}^{(k-1)}\in\boldsymbol{\mathcal{R}}_k\end{align*}

with $\{\bar{v}_n\}$ bounded and $v_n(x_0,\ldots,x_{k-1},\cdot)\ {\stackrel{*}{\rightharpoonup}\ }0$ , then $g_n=\mathfrak{A}_kv_n$ again satisfies the same properties, since

\begin{align*}g_n=\bar{g}_n\cdot \bar{v}^{(0)}\cdots\bar{v}^{(k-2)}\in\boldsymbol{\mathcal{R}}_{k-1}\end{align*}

with $\{\bar{g}_n\}$ bounded and $g_n(x_0,\ldots,x_{k-2},\cdot)\ {\stackrel{*}{\rightharpoonup}\ }0$ .

Proof of Lemma 2.3. Suppose that we have a sequence $\{\gamma_{n}\}_{n\in\mathbb{N}}$ in $\boldsymbol{\mathcal{Y}}$ converging to some $\gamma\in\boldsymbol{\mathcal{Y}}$ . We will show that $\mathbb{P}^{\gamma_n}_{\nu}\Rightarrow\mathbb{P}^\gamma_{\nu}$ . To prove this result, it suffices to show that for any $k\ge0$ and $f\in\boldsymbol{\mathcal{U}}_k$ , we have (recall (2.3))

(A.2) \begin{eqnarray}\lim_{n\rightarrow\infty} \int_{\textbf{H}_{k}} f(h_{k}) \mathbb{P}_{\nu,k}^{\gamma_{n}}(dh_{k}) = \int_{\textbf{H}_{k}} f(h_{k}) \mathbb{P}_{\nu,k}^{\gamma}(dh_{k}).\end{eqnarray}

Also, by [Reference Schäl28, Theorem 3.7(ii)], it suffices to show (A.2) for functions f which are of the form

(A.3) \begin{equation}f(x_0,a_0,\ldots,x_{k-1},a_{k-1},x_k)=\bar{f}(x_0,\ldots,x_{k-1},x_k) \hat{f}(a_0,\ldots,a_{k-1})\end{equation}

for $\bar{f}\in\boldsymbol{\mathcal{B}}(\textbf{X}^{k+1})$ and $\hat{f}\in\boldsymbol{\mathcal{C}}(\textbf{A}^{k})$ . As a preliminary step, we will establish the limit (A.2) for functions in $\boldsymbol{\mathcal{R}}_k$ , that is, those for which, in addition, the function $\hat{f}$ in (A.3) is the product of k functions in $\boldsymbol{\mathcal{C}}(\textbf{A})$ , denoted by $\bar{f}^{(i)}$ for $0\le i\le k-1$ .

We will prove the statement by induction. The limit in (A.2) trivially holds for $k=0$ because $\mathbb{P}^{\pi}_{\nu,0}=\nu$ for any $\pi\in\boldsymbol{\Pi}$ . Suppose that (A.2) is satisfied for some $k\ge0$ and every function in $\boldsymbol{\mathcal{R}}_k$ , and let us prove it for $k+1$ and any $f\in\boldsymbol{\mathcal{R}}_{k+1}$ . We have

(A.4) \begin{align} & \int_{\textbf{H}_{k+1}} f(h_{k+1}) \mathbb{P}_{\nu,k+1}^{\gamma_{n}}(dh_{k+1}) \nonumber \\& = \int_{\textbf{H}_{k}} \Big[ \int_{\textbf{A}}\int_{\textbf{X}} f(h_{k},a,y) q(x_{k},a,y) \lambda(dy) \gamma(da|x_{k}) \Big]\mathbb{P}_{\nu,k}^{\gamma_{n}}(dh_{k})\nonumber \\& \phantom{=} + \int_{\textbf{H}_{k}}\int_{\textbf{A}}\int_{\textbf{X}} f(h_{k},a,y) q(x_{k},a,y) \lambda(dy) \big[ \gamma_{n}(da|x_{k})-\gamma(da|x_{k}) \big]\mathbb{P}_{\nu,k}^{\gamma_{n}}(d h_{k}).\end{align}

We study the limit of the first term in the right-hand-side of (A.4) as $n\rightarrow\infty$ . Since $f\in\boldsymbol{\mathcal{R}}_{k+1}$ , we deduce from Lemma A.1(i) that the mapping defined on $(\textbf{X}\times\textbf{A})^{k}\times\textbf{X}$ by

\begin{align*} (x_0,a_0,\ldots,x_k) \mapsto \int_{\textbf{A}}\int_{\textbf{X}} f(h_{k},a,y) q(x_{k},a,y) \lambda(dy) \gamma(da|x_{k})\end{align*}

is in $\boldsymbol{\mathcal{R}}_{k}$ . By the induction hypothesis, we can take the limit as $n\rightarrow\infty$ and so obtain

\begin{align*}\lim_{n\rightarrow\infty}&\int_{\textbf{H}_{k}} \Big[\int_{\textbf{A}} \int_{\textbf{X}} f(h_{k},a,y) q(x_{k},a,y) \lambda(dy) \gamma(da|x_{k}) \Big]\mathbb{P}_{\nu,k}^{\gamma_{n}}(dh_{k}) \\& = \int_{\textbf{H}_{k}} \Big[\int_{\textbf{A}} \int_{\textbf{X}} f(h_{k},a,y) q(x_{k},a,y) \lambda(dy) \gamma(da|x_{k}) \Big]\mathbb{P}_{\nu,k}^{\gamma}(dh_{k})\\&= \int_{\textbf{H}_{k+1}} f(h_{k+1})\mathbb{P}_{\nu,k+1}^{\gamma}(dh_{k+1}).\end{align*}

Our goal now is to show that the second term in the right-hand-side of (A.4) converges to zero as $n\rightarrow\infty$ . Observe that the function $v_n$ on $(\textbf{X}\times\textbf{A})^{k}\times\textbf{X}$ defined as

\begin{align*}v_n(h_k)=\int_{\textbf{A}}\int_{\textbf{X}} f(h_{k},a,y) q(x_{k},a,y) \lambda(dy) \big[ \gamma_{n}(da|x_{k})-\gamma(da|x_{k}) \big]\end{align*}

takes the particular form

\begin{align*}\bar{f}^{(0)}(a_0)\cdots \bar{f}^{k-1}(a_{k-1}) \cdot \bar{v}_n(x_0,\ldots,x_k)\end{align*}

(here, recall that $f=\bar{f}\cdot\bar{f}^{(0)}\cdots\bar{f}^{(k)}\in\boldsymbol{\mathcal{R}}_{k+1}$ ), with

\begin{align*}\bar{v}_n(x_0,\ldots,x_k)=\int_{\textbf{A}}\int_{\textbf{X}} \bar{f}(x_0,\ldots,x_{k},y) \bar{f}^{(k)}(a) q(x_{k},a,y) \lambda(dy) \big[ \gamma_{n}(da|x_{k})-\gamma(da|x_{k}) \big].\end{align*}

Then $\{\bar{v}_n\}_{n\in\mathbb{N}}$ is a bounded sequence in $\boldsymbol{\mathcal{B}}(\textbf{X}^{k+1})$ , and also $\bar{v}_n(x_0,\ldots,x_{k-1},\cdot)\ {\stackrel{*}{\rightharpoonup}\ }0$ for any $(x_0,\ldots,x_{k-1})\in\textbf{X}^k$ . Indeed, given arbitrary $\Phi\in L^1(\textbf{X},\boldsymbol{\mathfrak{B}}(\textbf{X}),\lambda)$ we have

\begin{align*}&\int_{\textbf{X}} \bar{v}_n(x_0,\ldots,x_{k-1},z)\Phi(z)\lambda(dz)\\&=\int_{\textbf{X}}\int_{\textbf{A}} \Phi(z)\Big[\int_{\textbf{X}} \bar{f}(x_0,\ldots,z,y) \bar{f}^{(k)}(a) q(z,a,y) \lambda(dy) \Big]\big[ \gamma_{n}(da|z)-\gamma(da|z) \big]\lambda(dz).\end{align*}

Noting that the integral within brackets is a bounded Carathéodory function in $z\in\textbf{X}$ and $a\in\textbf{A}$ (recall Lemma A.1(i)), convergence to zero of the above integral follows from the definition of $\gamma_n\rightarrow\gamma$ in $\boldsymbol{\mathcal{Y}}$ .

We can therefore apply Lemma A.3 repeatedly to obtain that $\{\mathfrak{T}_k v_n\}_{n\in\mathbb{N}}$ is a bounded sequence in $\boldsymbol{\mathcal{B}}(\textbf{X})$ such that $\mathfrak{T}_kv_n(x_0) \rightarrow0$ for every $x_0\in\textbf{X}$ . By Lemma A.2 we obtain that the expression (A.4) satisfies the inequality

\begin{align*}&\int_{\textbf{H}_{k}}\int_{\textbf{A}}\int_{\textbf{X}} f(h_{k},a,y) q(x_{k},a,y) \lambda(dy) \big[ \gamma_{n}(da|x_{k})-\gamma(da|x_{k}) \big]\mathbb{P}_{\nu,k}^{\gamma_{n}}(d h_{k}) \\&=\int_{\textbf{H}_k} v_n(h_k)\mathbb{P}^{\gamma_n}_{\nu,k}(dh_k)\le \int_{\textbf{X}} \mathfrak{T}_k v_n d\nu,\end{align*}

where the right-hand term converges to 0 as $n\rightarrow\infty$ . Using a symmetric argument for the function $-f$ , we obtain that the above left-hand term converges to zero, and so (A.4) indeed tends to zero.

So far, we have established the limit (A.2) for functions in $\boldsymbol{\mathcal{R}}_k$ . Note, however, that we can apply the Stone–Weierstrass theorem to the vector space spanned by the functions of the form $\hat{f}(a_0,\ldots,a_{k-1})=\bar{f}^{(0)}(a_0)\cdots\bar{f}^{(k-1)}(a_{k-1})$ , which indeed separates points in $\textbf{A}^k$ , to obtain that the functions in $\boldsymbol{\mathcal{C}}(\textbf{A}^k)$ can be uniformly approximated by functions in the above-mentioned vector space. Hence it is easy to establish that the limit in (A.2) holds for any function of the form (A.3). This completes the proof of Lemma 2.3.

Proof of Theorem 2.1. By virtue of Assumption A, the control model $\mathcal{M}$ satisfies the conditions (S1) and (S2) in [Reference Balder3]. Therefore, combining Theorem 2.2, Lemma 2.4, and the proof of Proposition 3.2 in [Reference Balder3], it can be shown that the w-topology and the $ws^\infty$ -topology are identical on $\boldsymbol{\mathcal{S}}_{\nu}$ . Consequently, the set $\boldsymbol{\mathcal{S}}_{\nu}$ is compact for the ws $^\infty$ -topology by Theorem 2.1 in [Reference Balder3], and it is metrizable in this topology.

To prove compactness of $\boldsymbol{\mathcal{S}}^s_{\nu}\subseteq\boldsymbol{\mathcal{S}}_{\nu}$ , we show that it is a closed subset of $\boldsymbol{\mathcal{S}}_{\nu}$ . Suppose that we have a sequence $\{\mathbb{P}_{\nu}^{\gamma_{n}}\}_{n\in\mathbb{N}}\subseteq\boldsymbol{\mathcal{S}}^s_{\nu}$ , where $\gamma_{n}\in\boldsymbol{\mathcal{Y}}$ for each $n\in\mathbb{N}$ , converging to some probability measure $\mathbb{P}^\pi_{\nu}\in\boldsymbol{\mathcal{S}}_{\nu}$ for some $\pi\in\boldsymbol{\Pi}$ . But $\boldsymbol{\mathcal{Y}}$ being a compact metric space, there exists a subsequence $\{n'\}$ of $\{\gamma_n\}$ converging to some $\gamma\in\boldsymbol{\mathcal{Y}}$ . Since $\mathbb{P}^{\gamma_{n'}}_{\nu}\Rightarrow\mathbb{P}^\gamma_{\nu}$ (Lemma 2.3), it follows that the limiting strategic probability measure $\mathbb{P}^\gamma_{\nu}=\mathbb{P}^\pi_{\nu}$ is in $\boldsymbol{\mathcal{S}}^s_{\nu}$ . This completes the proof of Theorem 2.1.

Funding information

This work was supported by grant PID2021-122442NB-I00 from the Spanish Ministerio de Ciencia e Innovación.

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article

References

Aliprantis, C. D. and Border, K. C. (2006). Infinite Dimensional Analysis, 3rd edn. Springer, Berlin.Google Scholar
Balder, E. J. (1988). Generalized equilibrium results for games with incomplete information. Math. Operat. Res. 13, 265276.CrossRefGoogle Scholar
Balder, E. J. (1989). On compactness of the space of policies in stochastic dynamic programming. Stoch. Process. Appl. 32, 141150.CrossRefGoogle Scholar
Balder, E. J. (1991). On Cournot–Nash equilibrium distributions for games with differential information and discontinuous payoffs. Econom. Theory 1, 339354.CrossRefGoogle Scholar
Balder, E. J. (1992). Existence without explicit compactness in stochastic dynamic programming. Math. Operat. Res. 17, 572580.CrossRefGoogle Scholar
Bäuerle, N. and Rieder, U. (2011). Markov Decision Processes with Applications to Finance. Springer, Heidelberg.CrossRefGoogle Scholar
Bäuerle, N. and Rieder, U. (2014). More risk-sensitive Markov decision processes. Math. Operat. Res. 39, 105120.CrossRefGoogle Scholar
Bellman, R. (1957). Dynamic Programming. Princeton University Press.Google ScholarPubMed
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Belmont, MA.Google Scholar
Borkar, V. S. (2002). Convex analytic methods in Markov decision processes. In Handbook of Markov Decision Processes, Kluwer Academic Publishers, Boston, pp. 347375.CrossRefGoogle Scholar
Brezis, H. (2011). Functional Analysis, Sobolev Spaces and Partial Differential Equations. Springer, New York.CrossRefGoogle Scholar
Cavazos-Cadena, R. (2018). Characterization of the optimal risk-sensitive average cost in denumerable Markov decision chains. Math. Operat. Res. 43, 10251050.CrossRefGoogle Scholar
Dufour, F., Horiguchi, M. and Piunovskiy, A. B. (2012). The expected total cost criterion for Markov decision processes under constraints: a convex analytic approach. Adv. Appl. Prob. 44, 774793.CrossRefGoogle Scholar
Dufour, F. and Piunovskiy, A. B. (2013). The expected total cost criterion for Markov decision processes under constraints. Adv. Appl. Prob. 45, 837859.CrossRefGoogle Scholar
Dufour, F. and Prieto-Rumeau, T. (2022). Maximizing the probability of visiting a set infinitely often for a countable state space Markov decision process. J. Math. Anal. Appl. 505, paper no. 125639.CrossRefGoogle Scholar
Dynkin, E. B. and Yushkevich, A. A. (1979). Controlled Markov Processes. Springer, Berlin.CrossRefGoogle Scholar
Feinberg, E. A. and Rothblum, U. G. (2012). Splitting randomized stationary policies in total-reward Markov decision processes. Math. Operat. Res. 37, 129153.CrossRefGoogle Scholar
Hernández-Lerma, O. and Lasserre, J. B. (1996). Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York.CrossRefGoogle Scholar
Hernández-Lerma, O. and Lasserre, J. B. (1999). Further Topics on Discrete-Time Markov Control Processes. Springer, New York.CrossRefGoogle Scholar
Hinderer, K. (1970). Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter. Springer, Berlin.CrossRefGoogle Scholar
Howard, R. A. (1960). Dynamic Programming and Markov Processes. Technology Press of MIT.Google Scholar
Meyn, S. P. and Tweedie, R. L. (1993). Markov Chains and Stochastic Stability. Springer, London.CrossRefGoogle Scholar
Nowak, A. S. (1988). On the weak topology on a space of probability measures induced by policies. Bull. Polish Acad. Sci. Math. 36, 181186.Google Scholar
Piunovskiy, A. B. (1998). Controlled random sequences: methods of convex analysis and problems with functional constraints. Uspekhi Mat. Nauk 53, 129192.Google Scholar
Piunovskiy, A. B. (2004). Multicriteria impulsive control of jump Markov processes. Math. Meth. Operat. Res. 60, 125144.Google Scholar
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley, New York.CrossRefGoogle Scholar
Schäl, M. (1975). Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Z. Wahrscheinlichkeitsth. 32, 179196.CrossRefGoogle Scholar
Schäl, M. (1975). On dynamic programming: compactness of the space of policies. Stoch. Process. Appl. 3, 345364.CrossRefGoogle Scholar
Schäl, M. (1979). On dynamic programming and statistical decision theory. Ann. Statist. 7, 432445.CrossRefGoogle Scholar
Schäl, M. (1983). Stationary policies in dynamic programming models under compactness assumptions. Math. Operat. Res. 8, 366372.CrossRefGoogle Scholar
Schäl, M. (1990). On the chance to visit a goal set infinitely often. Optimization 21, 585592.CrossRefGoogle Scholar
Sennott, L. I. (1999). Stochastic Dynamic Programming and the Control of Queueing Systems. John Wiley, New York.Google Scholar
Venel, X. and Ziliotto, B. (2016). Strong uniform value in gambling houses and partially observable Markov decision processes. SIAM J. Control Optimization 54, 19832008.CrossRefGoogle Scholar
Zhang, Y. (2017). Continuous-time Markov decision processes with exponential utility. SIAM J. Control Optimization 55, 26362660.CrossRefGoogle Scholar