1. Introduction
This work is concerned with Markov decision processes (MDPs) on a denumerable state space endowed with a bounded cost function. It is assumed that the decision maker is risk-seeking with a constant risk-sensitivity coefficient, and the performance of a control policy is measured by a long-run average cost index. In addition to standard continuity–compactness conditions, the framework of the paper is mainly determined by two assumptions on the transition structure: (a) the state process is communicating under each stationary policy, and (b) the simultaneous Doeblin condition holds. In this context, the main objectives of this work can be described as follows:
-
(i) to prove that the optimal inferior and superior limit average cost functions coincide and are constant;
-
(ii) to characterize the optimal average cost.
The results for these problems extend to the present framework conclusions established in [Reference Cavazos-Cadena7] under the condition that the decision maker is risk-averse and, as in that paper, the characterization of the optimal average cost presented in Theorem 1(ii) is a generalized version of the Collatz–Wielandt formula for the largest eigenvalue of a positive matrix. On the other hand, an additional objective of the paper is the following:
-
(iii) to establish the existence of a stationary policy whose average index differs from the optimal one by less than a given tolerance.
The analysis of discrete-time Markov models endowed with a risk-sensitive average criterion can be traced back, at least, to the seminal paper [Reference Howard and Matheson14]. In that paper the Perron–Frobenius theory of positive matrices [Reference Meyer17] was employed to study MDPs with finite state and action spaces, and the optimal average cost was characterized via an optimality equation. On the other hand, motivated by important connections with the theory of large deviations and mathematical finance, risk-sensitive average criteria have recently been studied. Models with a countable state space are considered in [Reference Borkar and Meyn5, Reference Cavazos-Cadena and Fernández-Gaucherand8, Reference Denardo and Rothblum9, Reference Sladký20], whereas MDPs on a general state space are analyzed in [Reference Di Masi and Stettner10, Reference Di Masi and Stettner11, Reference Di Masi and Stettner12, Reference Jaśkiewicz15]. Connections of risk-sensitive average criteria with the theory of large deviations are presented in [Reference Balaji and Meyn2, Reference Kontoyiannis and Meyn16], whereas applications to mathematical finance are given in [Reference Bäuerle and Reider3, Reference Pitera and Stettner18, Reference Stettner21].
On the other hand, there are important differences between the risk-neutral and risk-sensitive average criteria. For instance, in the risk-neutral case the simultaneous Doeblin condition ensures that the optimal average cost is constant and is characterized via an optimality equation [Reference Hernández-Lerma13, Reference Puterman19], a result that is not generally valid in the risk-sensitive case [Reference Cavazos-Cadena6]. For this reason, in this work the simultaneous Doeblin condition will be supplemented with the requirement that the state space is communicating under the action of any stationary policy.
The remainder of the paper is organized as follows: In Section 2 the decision model is formally introduced, the basic assumptions of the paper are formulated, and the risk-sensitive average criteria are defined. Next, in Section 3 the idea of a subsolution of the optimality equation is formulated and the main conclusions of the paper are stated in Theorem 1, which is followed by a brief outline of the strategy used to prove that result. Section 4 contains the basic properties of the family of subsolutions which allow us to establish the fundamental auxiliary result in Theorem 2 of Section 5, establishing that if the space of stationary policies is finite and the cost function is nonpositive with compact support, then the optimality equation has a bounded solution. After those preliminaries, Theorem 1 is finally proved in Section 6, and the paper concludes with some brief comments in Section 7.
1.1. Notation
In the following, the set of nonnegative integers is denoted by $\mathbb{N}$ , and, for a real-valued function h, $\|h\|\,:\!=\, \sup\{|h(x)| \colon \hbox{belongs to the domain of } h\}$ is the corresponding supremum norm. Given a nonempty set S, the space of all bounded real-valued functions defined on S is denoted by $\mathcal{B}(S)$ , i.e. $h\colon S\rightarrow \mathbb{R}$ belongs to $\mathcal{B}(S)$ if and only if $\|h\| \lt\infty$ . On the other hand, for real numbers a and b, $a\land b = \min \{a, b\}$ , and, if $F\subset S$ , the indicator function of the subset F is denoted by $\mathbf{1}_F$ , that is, $\mathbf{1}_F(y) \,:\!=\,1$ , $y\in F$ , and $\mathbf{1}_F(y) = 0$ , $y\in S\setminus F$ . For an event W, the corresponding indicator function is denoted by $\mathbf{1}[W]$ , and every relation between random variables holds almost surely with respect to the underlying probability measure.
2. Decision model
Throughout, $\mathcal{M}=(S, A, \{A(x)\}_{x\in S}, C, P)$ represents an MDP, a model for a dynamical system whose components are as follows:
-
(i) The state space S is a denumerable set endowed with the discrete topology.
-
(ii) The control set A is a metric space.
-
(iii) For each state $x\in S$ , the nonempty set $A(x)\subset A$ is the class of possible actions at x.
-
(iv) $C\colon \mathbb{K}\to \mathbb{R}$ is the cost function, where $\mathbb{K}\,:\!=\, \{(x, a)\mid a\in A(x), x\in S\}$ is the collection of admissible pairs.
-
(v) $P=[p_{x,y}({\cdot})]$ is the controlled transition law.
This model $\mathcal{M}$ is interpreted as follows: At each time $t\in \mathbb{N}$ the state of a dynamical system is observed, say $X_t=x\in S$ , and a decision maker (controller) chooses and applies an action $A_t=a\in A(x)$ . As a consequence of such an intervention, a cost $C(X_t, A_t) = C(x,a)$ is incurred and, regardless of the previous states and actions, the state of the system at time $t+1$ will be $X_{t+1}= y\in S$ with probability $p_{X_t,X_{t+1}}(A_t)= p_{x,y}(a)$ , where $\sum_{y\in S} p_{x, y}(a) = 1$ ; this is the Markov property of the decision model.
Assumption 1.
-
(i) For every $x\in S$ , A(x) is a compact subset of A.
-
(ii) For each $x, y\in S$ , the mappings $a\mapsto p_{x, y}(a)$ and $a\mapsto C(x,a)$ are continuous in $a\in A(x)$ .
-
(iii) The cost function C is bounded, i.e. $C\in\mathcal{B}(\mathbb{K})$ .
The observed history of the process up to time n is denoted by
whereas
is the $\sigma$ -field generated by $H_n$ .
2.1. Policies
A policy is a measurable rule for choosing actions. Formally, a control policy $\pi = \{\pi_n\}_{n\in\mathbb{N}}$ is a (special) sequence of stochastic kernels $\pi_n$ on the action space A given $\mathbb{H}_n$ , where the space $\mathbb{H}_n$ of possible histories up to time n is given by $\mathbb{H}_0 = S$ , and $\mathbb{H}_n = \mathbb{K}^{n}\times S$ for $n\ge 1$ ; the vector $\textbf{h}_n= (x_0, a_0,\ldots, x_{n-1}, a_{n-1}, x_n)$ represents a generic element of $\mathbb{H}_n$ , so that $a_i \in A(x_i)$ for $i \lt n$ and $x_i \in S$ for $i\le n$ . When the controller drives the system using policy $\pi=\{\pi_n\}$ , given that the history $\mathbb{H}_n$ attains the value $\textbf{h}_n\in \mathbb{H}_n$ , the probability of choosing action $A_n$ inside a Borel subset B of $A(x_n)$ is $\pi_n(B\mid\textbf{h}_n)$ , where $\pi_n(A(x_n)\mid\textbf{h}_n) = 1$ . The family of all policies is denoted by $\mathcal{P}$ and, given the initial state $X_0 = x$ and the policy $\pi\in \mathcal{P}$ used to direct the system, the distribution $\mathbb{P}_x^\pi$ of the state–action process $\{(X_k, A_k)\}_{k\in \mathbb{N}}$ is uniquely determined [Reference Hernández-Lerma13, Reference Puterman19] and $\mathbb{E}_x^\pi[\,\cdot\,]$ represents the corresponding expectation operator. Next, set $\mathbb{F}\,:\!=\, \prod_{x\in S}A(x)$ , so that $\mathbb{F}$ is a compact metric space consisting of all functions $f\colon S\to A$ such that $f(x)\in A(x)$ for every $x\in S$ . A policy $\pi$ is stationary if there exists $f\in\mathbb{F}$ such that, under $\pi$ , at each time $t\in\mathbb{N}$ the action $A_t=f(X_t)$ is applied. The class of stationary policies and $\mathbb{F}$ are naturally identified, a convention allowing us to write $\mathbb{F}\subset \mathcal{P}$ .
2.2. Utility function and average criteria
In this paper we suppose that the controller has a constant risk-sensitivity coefficient $\lambda \ne 0$ , so that a random cost Y is assessed via the expected value of $U_{\lambda}(Y)$ , where the (dis-)utility function $U_{\lambda}$ is the strictly increasing mapping given by
notice that
When the decision maker has to choose between two random costs $Y_0$ and $Y_1$ , paying $Y_0$ will be preferred if $\mathbb{E}[U_{\lambda}(Y_1)] \gt\mathbb{E}[U_{\lambda}(Y_0)]$ , whereas the controller will be indifferent between both costs if $\mathbb{E}[U_{\lambda}(Y_1)]= \mathbb{E}[U_{\lambda}(Y_0)]$ . Given a random cost Y such that $\mathbb{E}[U_{\lambda} (Y)]$ is finite, the certainty equivalent of Y with respect to $U_{\lambda}$ is the unique real number $\mathcal{E}_{\lambda}[Y]$ determined by $U_{\lambda}(\mathcal{E}_{\lambda}[Y]) = \mathbb{E}[U_{\lambda}(Y)]$ , so that the controller will be willing to pay the amount $\mathcal{E}_{\lambda}[Y]$ to avoid facing the uncertainty associated with Y; note that
a relation that immediately leads to
Observe that $U_{\lambda}({\cdot})$ is strictly convex (resp. concave) if $\lambda \gt 0$ (resp. $\lambda \lt 0$ ), and in that case Jensen’s inequality yields that $\mathcal{E}_{\lambda}(Y) \gt \mathbb{E}[Y] $ (resp. $\mathcal{E}_{\lambda}(Y) \lt \mathbb{E}[Y]$ ) if Y is a non-constant random variable. When $\lambda \gt 0$ (resp. $\lambda \lt 0$ ) the controller is referred to as risk-averse (resp. risk-seeking). Using the idea of the certainty equivalent, risk-sensitive average criteria can be defined as follows: Let $\pi\in\mathcal{P}$ be the policy used to operate the system starting at $X_0 = x$ . The application of the first n actions $A_0, A_1,\ldots, A_{n-1}$ generates the cost $\sum_{k=0}^{n-1} C(X_k, A_k)$ and, using (5), the associated certainty equivalent is
which represents an average of $J_n(\pi, x)/n$ per step. The superior limit ( $\lambda$ -sensitive) average performance index of policy $\pi$ at state x is given by
and the corresponding optimal value function is
a policy $\pi^{\ast}\in \mathcal{P}$ is $\limsup$ ( $\lambda$ -)average optimal if $J(\pi^{\ast}, x)=J^{\ast}(x)$ for every $x\in S$ . The criterion (8) represents the worst-case point of view about the behavior of the sequence $\{J_n(\pi, x)/n\}$ of averages over a finite horizon. The optimistic perspective assesses the performance of $\pi$ via the smallest limit point of that sequence:
with the corresponding optimal value function given by
and $\pi_{\ast}\in\mathcal{P}$ is $\liminf$ ( $\lambda$ -)average optimal if $J _-(\pi_*, \cdot) = J_*({\cdot})$ . It follows that
From the definitions in (7)–(11), using (6), it is not difficult to verify that
Remark 1. In Section 6, several cost functions are considered simultaneously. In that case, the cost function is explicitly indicated in the above notation for the average criteria and optimal value functions. For instance, $J_*(C, \cdot)$ and $J ^{\ast}(C, \cdot)$ are used instead of $J_*({\cdot})$ and $J^{\ast}({\cdot})$ ; notice that
2.3. Optimality equation
Under appropriate requirements, the characterization of the optimal value functions in (9) and (11) can be based on the following optimality equation:
where $g\in\mathbb{R}$ and $h({\cdot})$ is a function defined on S. Suppose that this equality holds, and set
Remark 2. Direct consequences of (15) are the following facts:
-
(i) Let $\pi = \{\pi_n\}\in\mathcal{P}$ be a fixed policy. From (14), it follows that, for every $n\in \mathbb{N}$ and $a\in A(X_n)$ , $U_{\lambda}( g + h(X_n))\le \sum_{y\in S} p_{X_n, y}(a) U_{\lambda}(C(X_n, a) + h(y))$ ; integrating both sides of this relation with respect to $\pi_n(\text{d} a \mid H_n)$ and using the Markov property, it follows that
(16) \begin{align} U_{\lambda} (h(X_n)) & \le \int_{A(X_n)} \sum_{y\in S} p_{X_n, y}(a) U_{\lambda}(C(X_n, a) + h(y))\pi_n(\text{d} a\mid H_n) \cr & = \mathbb{E}_x^\pi[U_{\lambda}(C(X_n, A_n)- g + h(X_{n+1})) \mid \mathcal{F}_n], \quad x\in S, \ n\in\mathbb{N}. \end{align}Next, using that $\exp\big\{\lambda\sum_{t=0} ^{n-1} (C(X_t, A_t) - g)\big\}$ is $\mathcal{F}_n$ -measurable, (4) and the two previous displays together yield(17) \begin{align} Y_n & = U_{\lambda}\Bigg(\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + h(X_n)\Bigg) \nonumber \\ & = \exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g)\Bigg\}U_{\lambda}(h(X_n)) \nonumber \\ & \leq \exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g)\Bigg\} \mathbb{E}_x^\pi[U_{\lambda}(C(X_n, A_n)-g+h(X_{n+1})) \mid \mathcal{F}_n] \nonumber \\ & = \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g)\Bigg\} U_{\lambda}(C(X_n, A_n)-g+h(X_{n+1})) \mid \mathcal{F}_n\Bigg] \nonumber \\ & = \mathbb{E}_x^\pi\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^n(C(X_t, A_t)-g)+h(X_{n+1})\Bigg) \mid \mathcal{F}_n\Bigg] = \mathbb{E}_x^\pi[Y_{n+1} \mid \mathcal{F}_n], \end{align}that is, $\{(Y_n,\mathcal{F}_n)\}_{n\in \mathbb{N}}$ is a submartingale with respect to $P_x^\pi$ for every $x\in S$ and $\pi\in\mathcal{P}$ . -
(ii) Suppose that function $h({\cdot})$ in (14) is bounded. In this context, using Assumption 1, it is not difficult to see that, for each $x\in S$ , the term within brackets in (14) is a continuos function of $a\in A(x)$ , and then it has a maximizer $f^{\ast}(x)\in A(x)$ . Thus, $U_{\lambda}(g + h(x)) = \mathbb{E}_x^{f^{\ast}}[U_{\lambda}(C(x, f^{\ast}(x)) + h(x))]$ for every $x\in S$ and, paralleling the argument in part (i), it follows that the equality always holds in (16) and (17) when $\pi$ is replaced by $f^{\ast}$ , so that $\{(Y_n,\mathcal{F}_n)\}_{n\in \mathbb{N}}$ is a martingale with respect to $\mathbb{P}_x^{f^{\ast}}$ for every $x\in S$ .
-
(iii) Maintaining the assumption that $h({\cdot})$ is bounded, observe that, for every $n\in \mathbb{N}\setminus \{0\}$ , $x\in S$ , and $\pi\in\mathcal{P}$ , part (i) implies that $\mathbb{E}_x^\pi[Y_0] \le \mathbb{E}_x^\pi[Y_n]$ , an inequality that, using (15), is equivalent to $U_{\lambda}(h(x)) \le \mathbb{E}_x^\pi\big[U_{\lambda}\big(\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + h(X_{n+1})\big)\big]$ ; recalling that $U_{\lambda}({\cdot})$ is increasing, via (4) and (7) it follows that
\begin{align*} U_{\lambda}(h(x)) & \le \mathbb{E}_x^\pi\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + \|h\|\Bigg)\Bigg] \\ & = \text{e}^{\lambda(\|h\| - ng)}\mathbb{E}_x^\pi\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}C(X_t, A_t)\Bigg)\Bigg] \\ & = \text{e}^{\lambda(\|h\| - ng)}U_{\lambda}(J_n(\pi, x)) = U_{\lambda}(J_n(\pi, x) - ng + \|h\|). \end{align*}Thus, $h(x) \le J_n(\pi, x) - ng + \|h\|$ , and then $g \le (1/n)(J_n(\pi, x) + \|h\| - h(x))$ for every positive integer n. Therefore,(18) \begin{equation} g \le J_-(\pi, \cdot), \quad \pi\in\mathcal{P}, \qquad \hbox{so that } g \le J_*({\cdot}), \end{equation}by (10) and (11). On the other hand, part (ii) above yields $\mathbb{E}_x^{f^{\ast}}[Y_0] = \mathbb{E}_x^{f^{\ast}}[Y_n]$ , which is equivalent to $U_{\lambda}(h(x)) = \mathbb{E}_x^{f^{\ast}}\big[U_{\lambda}\big(\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + h(X_{n+1})\big)\big]$ ; paralleling the above argument, it follows that
\begin{align*} U_{\lambda}(h(x)) & \geq \mathbb{E}_x^{f^{\ast}}\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}(C(X_t, A_t)-g)-\|h\|\Bigg)\Bigg] \\ & = \text{e}^{-\lambda(ng+\|h\|)}\mathbb{E}_x^{f^{\ast}}\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}C(X_t, A_t)\Bigg)\Bigg] \\ & = \text{e}^{-\lambda(ng+\|h\|)}U_{\lambda}(J_n(f^{\ast}, x)) = U_{\lambda}({-}(ng+\|h\|) + J_n(f^{\ast}, x)). \end{align*}Therefore, $h(x) \ge J_n(f^{\ast} , x) - (ng + \|h\|)$ , and then $g \ge (1/n)(J_n(f^{\ast}, x) - (\|h\| + h(x)))$ ; via (8) and (9), it follows that $g \ge J(f^{\ast}, x) \ge J^{\ast}(x)$ , $x\in S$ . Combining this relation with (12) and (18), it follows that $g \le J_{*} \le J^{\ast}({\cdot})\le g$ , so that $J_*({\cdot}) = g = J^{\ast}({\cdot})$ and $J_-(f^{\ast}, \cdot) = g = J(f^{\ast}, \cdot)$ , i.e. $g = \lim_{n\to\infty}n^{-1}J_n(f^{\ast},\cdot)$ .
2.4. Assumptions for a constant average cost
Even under strong communication-ergodicity conditions, like those in Assumption 2 below, in the present context of a denumerable state space the existence of a solution of the optimality equation cannot be generally guaranteed for an arbitrary bounded cost function; an (uncontrolled) example illustrating this phenomenon was given in Section 9 of [Reference Cavazos-Cadena7]. However, as we verify later, the strong requirements imposed below ensure that the optimal value functions are constant. To continue, it is convenient to introduce the following notation.
For each set $F \subset S$ , define the first return time to F by
where, by convention, the minimum of the empty set is $\infty$ ; if $F = \{x\}$ is a singleton, the simpler notation
is used. Combining the above definition with (1) and (2), it follows that $[T_F= n]\in\mathcal{F}_n$ for every $n\in\mathbb{N}$ , so that $T_F$ is a stopping time with respect to the filtration $\{\mathcal{F}_n\}$ .
Assumption 2. There exists $z\in S$ such that the following properties hold.
-
(i) Accessibility from z:
Under the action of any stationary policy, every state $y\in S$ is accessible from z, i.e. $\mathbb{P}_{z}^f[T_y \lt \infty] \gt 0$ , $y\in S$ , $f\in \mathbb{F}$ .
-
(ii) Simultaneous Doeblin condition:
The first return time $T_{z}$ satisfies $\sup_{x\in S, f\in\mathbb{F}} \mathbb{E}_x^f[T_{z}] \lt\infty$ .
The proof of the following result can be found in [Reference Cavazos-Cadena7].
Lemma 1. Suppose that Assumptions 1 and 2 hold. Then:
-
(i) For every $y \in S$ , there exists a finite constant $M_{y}$ such that
(21) \begin{equation} \mathbb{E}_x^\pi[T_y] \le M_{y}, \qquad x\in S,\ \pi\in\mathcal{P}. \end{equation} -
(ii) For arbitrary different states $x, y\in S$ , $\mathbb{P}_x^\pi[T_y \lt T_x] \gt 0$ , $\pi\in\mathcal{P}$ .
2.5. The problems
As already observed, under Assumptions 1 and 2, the optimal average cost cannot be generally characterized via the optimality equation (14), since it does not necessarily admit a solution. Thus, looking for a characterization of the optimal average cost is an interesting problem. The main objectives of this paper are:
to show that the optimal value functions $J^{\ast}({\cdot})$ and $J_*({\cdot})$ coincide and are constant;
to characterize the optimal average cost.
These problems were addressed in [Reference Cavazos-Cadena7] for the case $\lambda \gt 0$ of a risk-averse controller, so, from this point onwards, our analysis focuses on the risk-seeking case $\lambda \lt 0$ .
3. Main result
In this section the main conclusions of this work are stated in Theorem 1, which extends to the risk-seeking context results established in [Reference Cavazos-Cadena7] for the risk-averse case. The ideas in the following definition will be useful.
Definition 1.
-
(i) The set $\tilde{\mathcal{G}}$ of subsolutions of the optimality equation (14) for model $\mathcal{M}$ consists of all pairs $(g, h({\cdot}))$ , where $g \in\mathbb{R}$ and the function $h \colon S \to \mathbb{R}$ satisfy
(22) \begin{equation} U_{\lambda}(g + h(x)) \le \inf_{a\in A(x)}\Bigg[\sum_{y\in S}p_{x, y}(a)U_{\lambda}(C(x,a) + h(y))\Bigg], \qquad x\in S. \end{equation} -
(ii) The set $\mathcal{G}$ is the projection of $\tilde{\mathcal{G}}$ on its first coordinate, i.e. $\mathcal{G} = \{g\in\mathbb{R} \mid (g, h) \in \tilde{\mathcal{G}}$ for some $h \colon \mathbb{R} \to \mathbb{R}\}$ .
Remark 3. When two cost functions are simultaneously being examined, the sets $\tilde{\mathcal{G}}$ and $\mathcal{G}$ in the above definition are denoted by $\tilde{\mathcal{G}}(C)$ and $\mathcal{G}(C)$ , respectively, making it clear what cost function is being considered. With this notation, it follows that, for each $\beta\in\mathbb{R}$ ,
whereas
Via (3), it is not difficult to see that if $\lambda \lt 0$ , then inequality (22) is equivalent to
Using the above notation, the main result of the paper can be stated as follows.
Theorem 1. Suppose that $\lambda \lt 0$ , and that Assumptions 1 and 2 hold. In this context, the following assertions are valid:
-
(i) The inclusion $g \in \mathcal{G}$ is equivalent to $g \le J_*({\cdot})$ .
-
(ii) For each $x\in S$ , $J_*(x) = \sup\{g \mid g \in \mathcal{G}\} = J^{\ast}(x)$ .
-
(iii) Set $g^{\ast} = \sup \{g \mid g \in \mathcal{G}\}$ . With this notation, $g^{\ast} \in \mathcal{G}$ . More explicitly, there exists $h^{\ast} \colon S \to \mathbb{R}$ such that $(g^{\ast},h^{\ast}({\cdot})) \in \tilde{\mathcal{G}}$ , that is,
(25) \begin{equation} \text{e}^{\lambda g^{\ast} + \lambda h^{\ast}(x)} \ge \sup_{a\in A(x)} \Bigg[\text{e}^{\lambda C(x,a)}\sum_{y\in S}p_{x, y}(a)\text{e}^{\lambda h^{\ast}(y)}\Bigg], \qquad x\in S. \end{equation} -
(iv) For each $\varepsilon \gt 0$ , there exists a stationary policy $f\in \mathbb{F}$ which is $\varepsilon$ -optimal in the sense that $g^{\ast} \le J_-(f, x) \le J(f, x) \le g^{\ast} + \varepsilon$ , $x\in S$ .
The equality in Theorem 1(ii) is an extension of the Collatz–Wielandt formula for the largest eigenvalue of a positive (finite-dimensional) matrix ([Reference Meyer17]); see, for instance, the discussion on this point presented in Remark 3.2 of [Reference Cavazos-Cadena7]. The rather technical proof of the above result will be presented in Section 6. Roughly, the backbone of the argument consists in ‘approximating’ the optimal average cost of the original model $\mathcal{M}$ , via MDPs with finite spaces of stationary policies and cost functions with finite support. This strategy is developed in the following two sections, and can be briefly outlined as follows: Section 4 concerns basic properties of the class of subsolutions. First, it is shown that if $(g,h({\cdot}))\in\tilde{\mathcal{G}}$ then $h({\cdot})$ is bounded from above, and g is a lower bound for the optimal inferior limit average index; moreover, given that the system is driven by $\pi\in\mathcal{P}$ starting at $x\in S$ , it is shown that h(x) is bounded from above by the certainty equivalent of the total relative cost $\sum_{t=0}^{T_w -1 }(C(X_t, A_t)- g)$ up to the first return time to a given state w. Additionally, such a certainty equivalent is studied for the case in which the system evolves under a stationary policy, and the results in this direction are used in Section 5 to show that, if the space of stationary policies is finite and C has finite support, then the optimality equation (14) admits a solution $(g, h({\cdot}))$ where the mapping $h({\cdot})$ is bounded. After these preliminaries, the proof of Theorem 1 is finally presented in Section 6 before the concluding remarks.
4. Fundamental tools
This section presents auxiliary results that will be used to establish Theorem 1. The starting point is the following lemma, which concerns two basic properties of the pairs in $\tilde{\mathcal{G}}$ , namely, if $(g,h({\cdot})) \in \tilde{\mathcal{G}}$ then (i) the functional part h is bounded from above, and (ii) g is a lower bound of the optimal inferior limit average index.
Lemma 2. Given that $\lambda \lt 0$ , under Assumptions 1 and 2 the following properties hold for each $w\in S$ and $(g,h({\cdot})) \in \tilde{\mathcal{G}}$ :
-
(i) For each $x\in S$ and $\pi\in \mathcal{P}$ ,
(26) \begin{equation} \text{e}^{\lambda h(x) - \lambda h(w)} \ge \mathbb{E}_x^\pi \Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t, A_t) - g)\Bigg\}\Bigg], \end{equation}and then(27) \begin{equation} 1 \ge \mathbb{E}_w^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t, A_t) - g)\Bigg\}\Bigg]. \end{equation} -
(ii) $h({\cdot}) - h(w) \le \|C-g\| M_w$ , where $M_w$ is as in (21).
-
(iii) $g \le J_*(x)$ for each $x\in S$ .
Proof. For (i), let $\pi\in\mathcal{P}$ and $x\in S$ be arbitrary. Given $n\in \mathbb{N}$ , observe that, combining Definition 1 with (24), it follows that $\text{e}^{\lambda h(X_n)} \ge \sum_{y\in S}\text{e}^{\lambda(C(X_n, a)-g)}p_{X_n,y}(a)\text{e}^{\lambda h(y)}$ for every $a\in A(X_n)$ , and then
where the equality is due to the Markov property. Now set $Z_0 = e^{\lambda h(X_0)}$ , $Z_n = \exp\big\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t) - g) + \lambda h(X_n)\big\}$ , $n= 1,2,3,\ldots$ Using these and the previous display, and the fact that $\exp\big\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t) - g)\big\}$ is $\mathcal{F}_n$ -measurable, an argument along the lines of the one used in Remark 2 yields that $\{(Z_n,\mathcal{F}_n)\}$ is a supermartingale with respect to $\mathbb{P}_x^\pi$ , i.e. $Z_n \ge \mathbb{E}_x^\pi[Z_{n+1} \mid \mathcal{F}_n]$ , $n\in\mathbb{N}$ , $\mathbb{P}_x^\pi$ -almost surely (a.s.). Therefore, since $T_w$ is a stopping time with respect to the filtration $\{\mathcal{F}_n\}$ , the optional sampling theorem yields $\mathbb{E}_x^\pi[Z_0] \ge \mathbb{E}_x^\pi[Z_{n\land T_w}]$ (Theorem 7.7.3, p. 304 of [Reference Ash1]; Theorem 35.2, p. 405 of [Reference Billingsley4]), i.e.
Now observe that $\mathbb{P}_x^\pi[T_w \lt \infty] = 1$ , by (21), and that $X_{T_w} = w$ on the event $[T_w \lt\infty]$ , by (19). Therefore, with probability 1 with respect to $\mathbb{P}_x^\pi$ ,
Thus, via Fatou’s lemma, (28) yields
and (26) follows; setting x equal to w leads to (27).
For (ii), from Jensen’s inequality it follows that
where the negativity of $\lambda$ and (21) were used in the last step. The above relation and (26) together imply that $\lambda h({\cdot}) - \lambda h(w) \ge \lambda\|C - g\|M_w$ , and the conclusion follows.
For (iii), let $x\in S$ and $\pi\in\mathcal{P}$ be arbitrary. Recalling that $\{(Z_n,\mathcal{F}_n)\}$ is a supermartingale with respect to $\mathbb{P}_x^\pi$ , for every positive integer n,
where the second inequality is due to part (ii), and (7) was used in the last step. It follows that $\lambda h(x) \ge \lambda J_n(\pi, x) - n\lambda g + \lambda(h(w) + \|C - g\|M_w)$ , and then
Taking the inferior limit as n goes to $\infty$ in both sides of this relation, it follows that $g \le J_-(\pi, x)$ , and then $g \le J_*({\cdot})$ , since $x\in S$ and $\pi\in\mathcal{P}$ are arbitrary in this argument.
The remainder of the section is dedicated to studying inequality (27) for a stationary policy f, and it is convenient to introduce additional notation.
Definition 2. Let $f\in \mathbb{F}$ , $g\in\mathbb{R}$ , and $w\in S$ be arbitrary but fixed, and define the total relative cost function up to the first return time to state w under f by
Notice that this definition and (29) with f instead of $\pi$ together yield $\lambda h_{f,g,w}(x) \ge \lambda\|C-g\|M_w$ , and the negativity of $\lambda$ leads to
On the other hand, a conditional argument combining Definition 2 with the Markov property yields
and then
Next, the following lemma shows that if $h_{f,g,w}(w)$ is finite at some point, then $h_{f,g,w}({\cdot})$ is finite on S, and a fundmental property of the mapping $y \mapsto h_{f, g, y}(y)$ will be established, namely, if that function attains a non-negative value at some point, then it is always non-negative.
Lemma 3. Assume that $\lambda \lt 0$ and that Assumptions 1 and 2 are valid. In this context, the following assertions hold for arbitrary $f \in \mathbb{F}$ , $g \in \mathbb{R}$ , and $w\in S$ .
-
(i) If $h_{f, g, w}(w) \gt -\infty$ , then $h_{f,g,w}(x) $ is finite for every $x\in S$ .
-
(ii) The following assertions are equivalent:
-
(a) $h_{f, g, w}(w) \ge 0$ .
-
(b) $h_{f, g, w}({\cdot})$ is finite and the following Poisson inequality holds:
(32) \begin{equation} \text{e}^{\lambda h_{f,g,w}(x)} \ge \text{e}^{\lambda(C(x,f(x))-g)}\sum_{y\in S}p_{x,y}(f(x))\text{e}^{\lambda h_{f,g,w}(y)}, \qquad x\in S. \end{equation} -
(c) $h_{f, g, y}(y) \ge 0$ for every $y\in S$ .
-
Proof. For (i), given $x\in S\setminus \{w\}$ , using Lemma 1(ii) pick a positive integer k satisfying
and notice that $[T_x = k \lt T_w] = [X_t \notin \{x, w\}, 1\le t \lt k, X_k = x]$ belongs to $\mathcal{F}_k$ (see (2)), so that
where the last equality is due to the Markov property. Since $X_k = x$ on the event $[T_x = k]$ , it follows from Definition 2 that
so that
and then (33) yields
Since $x\in S\setminus \{w\}$ is arbitrary in this argument, this last display and (30) together yield that if $h_{f,g,w}(w)$ is finite then $h_{f,g,w}(x)$ is finite for every $x\in S$ .
For (ii) (c) $\implies$ (a), the implication is clear. For (a) $\implies$ (b), suppose that $h_{f,g,w}(w) \ge 0$ , so that $h_{f,g,w}({\cdot}) $ is a finite function, by part (i), and $1 \ge \text{e}^{\lambda h_{f,g,w}(w)}$ , since $\lambda$ is negative. Thus, from (31),
and (32) follows. For (b) $\implies$ (c), let $y\in S$ be arbitrary but fixed. Consider the new MDP $\mathcal{M}_f$ obtained from $\mathcal{M}$ by setting $ A(x) = \{f(x)\}$ . In this case f is the unique policy in this new model, and (32) establishes that the pair $(g,h_{f,g,w}({\cdot}))$ belongs to the set $\tilde{\mathcal{G}}(\mathcal{M}_f)$ of subsolutions corresponding to $\mathcal{M}_f$ , so that the conclusions of Lemma 2(i) applied to model $\mathcal{M}_f$ with y instead of w hold. In particular, inequality (27) with f instead of $\pi$ is valid, and then $1 \ge \mathbb{E}_y^f\big[\exp\big\{\lambda\sum_{t=0}^{T_y-1}(C(X_t,A_t)-g)\big\}\big] = \text{e}^{\lambda h_{f,g,y}(y)}$ , so that $h_{f,g,y}(y)\ge 0$ .
The next result is a natural complement of Lemma 3(ii).
Lemma 4. Let $f\in \mathbb{F}$ , $g\in\mathbb{R}$ , and $w\in S$ be arbitrary but fixed. Under the conditions in Lemma 3, the following assertions hold.
-
(i) $h_{f,g,w}(w) \gt 0 \iff h_{f,g,y}(y) \gt 0$ for every $y\in S$ .
-
(ii) $h_{f,g,w}(w) = 0 \iff h_{f,g,y}(y) = 0$ for every $y\in S$ .
Proof. For (i), the $\Longleftarrow$ part is clear. To prove the $\implies$ , suppose that $h_{f,g,w}(w) \gt 0$ . In this case $h_{f,g,w}({\cdot})$ is finite by Lemma 3, and, recalling that $\lambda$ is negative, $1 \gt \text{e}^{\lambda h_{f,g,w}(w)} = \mathbb{E}_w^f\big[\exp\big\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\big\}\big]$ . Thus,
Next, define the cost function $\hat {C}$ by $\hat C(x,\cdot) = C(x,\cdot)$ , $x\ne w$ , $\hat{C}(w,\cdot) = C(w,\cdot) + \rho$ , and consider the new model $\hat{\mathcal{M}}$ obtained from $\mathcal{M}$ by replacing C by $\hat{C}$ . In this case, if $\hat{h}_{f,g,w}$ is the total relative cost in Definition 2 associated with model $\hat{\mathcal{M}}$ , (34) yields that $\text{e}^{\lambda\hat{h}_{f,g,w}(w)} = 1$ , i.e. $\hat{h}_{f,g,w}(w) = 0$ , and then the equivalence of assertions (a) and (c) in Lemma 3(ii) applied to model $\hat{\mathcal{M}}$ implies that
Next, combining this fact with the inequality $h_{f,g,w}(w) \gt 0$ , we show that $h_{f,g,y}(y) \gt 0$ for $y\in S$ . Since $h_{f,g,w}(w) \gt 0$ , to achieve this goal it is sufficient to establish the above inequality for $y\ne w$ . In this context, observe that (35) yields
where the strict inequality follows by observing that (a) $\exp\big\{\lambda\rho\sum_{t=0}^{T_y-1}\mathbf{1}_{\{w\}}(X_t)\big\} \ge 1$ , since $\lambda$ and $\rho$ are negative, and that (b) $\exp\big\{\lambda\rho\sum_{t=0}^{T_y-1}\mathbf{1}_{\{w\}}(X_t)\big\} \gt 1$ on the event $[T_w \lt T_y]$ , which has positive probability with respect to $\mathbb{P}_y^f$ by Lemma 1(ii), whereas the last equality is due to Definition 2. Thus, $1 \gt \text{e}^{\lambda h_{f,g,y}(y)}$ , and then $h_{f,g,y}(y) \gt 0$ , since $\lambda$ is negative. This completes the proof of part (i).
For (ii), the assertion follows by combining part (i) with the equivalence of properties (a) and (c) in Lemma 3(ii).
5. Optimality equation in a special case
The objective of this section is to provide sufficient conditions to ensure the existence of a bounded solution to the optimality equation (14). The result in this direction requires special conditions on the cost function and the action sets, and is stated in Theorem 2. That result will be established using the following lemma.
Lemma 5. Let $\lambda \lt 0$ and $f\in \mathbb{F}$ be arbitrary but fixed. Suppose that the cost function $C(\cdot, f({\cdot}))$ under f satisfies the following requirements:
-
(a) $C(x,f(x)) \le 0$ for every $x\in S$ ;
-
(b) $C(\cdot,f({\cdot}))$ has finite support, i.e.
(36) \begin{equation} F \,:\!=\, \{x \in S \mid C(x,f(x)) \lt 0\}\ { is\ finite}. \end{equation}
In this case there exists a real number g(f) such that the following properties are valid:
-
(i) $g(f)\le 0$ , and the functions $\{h_{f,g(f),w}\}_{w \in S}$ satisfy
(37) \begin{equation} \text{e}^{\lambda h_{f,g(f),w}(w)} = \mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f))\Bigg\}\Bigg] = 1, \qquad w\in S; \end{equation}see Definition 2. -
(ii) For each $w\in S$ , the function $h_{f,g(f),w}({\cdot})$ is finite and the following Poisson equation holds:
(38) \begin{equation} \text{e}^{\lambda h_{f,g(f),w}(x)} = \text{e}^{\lambda(C(x,f(x))-g(f))}\sum_y p_{x,y}(f(x))\text{e}^{\lambda h_{f,g(f),w}(y)}, \qquad x\in S. \end{equation} -
(iii) The function $h_{f,g(f),w}({\cdot})$ is bounded for each $w\in S$ . More explicitly,
$$-\infty \lt\inf_{x\in S}h_{f,g(f),w}(x) \le h_{f,g(f),w}({\cdot}) \le M_w\|C - g(f)\|.$$ -
(iv) Given $w\in S$ , for $n\in \mathbb{N}$ set
(39) \begin{equation} W_n(f) \,:\!=\, \exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f)) + \lambda h_{f,g,w}(X_{n\land T_w})\Bigg\}. \end{equation}With this notation, $W_n(f) = \mathbb{E}_x^f\big[\exp\big\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f))\big\} \mid \mathcal{F}_n\big]$ , $\mathbb{P}_x^f$ -a.s., $x \in S$ , $n\in\mathbb{N}$ ; see (2), (19), and (20).
Proof. Recall that the equality $A_t = f(X_t)$ is always valid when the system is driven using f.
The argument to prove (i) is by induction on $|F|$ , the number of elements of the set F in (36). If $|F| = 0$ then $C(\cdot,f({\cdot})) = 0$ , so that, setting $g(f) = 0$ , it follows that (37) holds for each $x\in S$ . Next, assume that, for some nonnegative integer n, the desired conclusion is valid when the support of $C(\cdot,f({\cdot}))$ has n elements, and suppose that $|F| = n+1$ . Pick $\tilde x \in F$ , so that
and set $\tilde{F} = F \setminus \{\tilde{x}\}$ , $\tilde{C}(x,a) = C(x,a)\mathbf{1}_{\tilde{F}}(x)$ , $(x,a) \in \mathbb{K}$ , so that
It follows that the support of the function $\tilde{C}(\cdot,f({\cdot}))$ is the set $\tilde{F}$ , which has n elements. Thus, by the induction hypothesis there exists $\tilde{g}(f)\le 0$ such that (37) holds with $\tilde{C}$ and $\tilde{g}(f)$ instead of C and g(f), respectively; in particular,
Next, observe that (41) yields
where the second equality was obtained using that $X_t \ne \tilde x$ for $1 \le t \lt T_{\tilde x}$ , by (19) and (20), whereas the relation $\mathbb{P}_{\tilde x}^f[X_0 = \tilde x] = 1$ was used in the last step. Thus, (37) leads to
where, recalling that $\lambda \lt 0$ , the inequailty is due to (40). Now observe that
whereas, since the mapping $\gamma \mapsto \exp\big\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t)-\gamma)\big\}$ is increasing,
Combining (42) with the two previous displays, the dominated convergence theorem implies that $v(\gamma) \,:\!=\, \mathbb{E}_{\tilde x}^f\big[\exp\big\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t) - \gamma)\big\}\big]$ is continuous in $\gamma \in ({-}\infty,\tilde g(f)]$ , and that $v(\gamma) \searrow 0$ as $\gamma \searrow -\infty$ ; since $v(\tilde g(f)) = \text{e}^{\lambda C(\tilde x,f(\tilde x))} \gt 1$ , by (42), the intermediate value property implies that there exists $g(f) \in ({-}\infty,\tilde g(f)) \subset({-}\infty,0)$ such that
From this point, Lemma 4(ii) yields that (37) holds for every $x\in S$ . This completes the induction argument.
For (ii), since $\text{e}^{\lambda h_{f,g(f),w}(w)} = 1$ for every $w\in S$ , by part (i), Lemma 3(i) yields that $h_{f,g(f),w}({\cdot})$ is finite, and (38) follows from (31).
For (iii), recall that the equality $\mathbb{P}_x^\pi[T_w \lt\infty] = 1$ is always valid, by Lemma 1(i). Now, define the finite set $G = \{w\}\cup F$ such that $T_G\le T_w$ and $T_G \le T_F$ , and then
Next, select an initial state
so that
and observe that
where the third equality is due to (43)–(45), and the inequality stems from the negativity of $\lambda$ and g(f). It follows that, for every $x\in S\setminus G$ ,
where, via a conditional argument, the second equality follows from the Markov property. Thus, since $X_{T_G} \in G$ on the event $[T_G \lt \infty]$ and $\mathbb{P}_x^f[T_G \lt \infty] \ge \mathbb{P}_x^f[T_w \lt \infty] = 1$ , it follows that
and then $h_{f,g(f),w}(x) \ge \lambda^{-1}[\log(2) + \max_{y \in G}|\lambda h_{f,g(f),w}(y)|]$ when $x\in S\setminus G$ ; since $h_{f,g(f),w}({\cdot})$ is a finite function and the set G is finite, it follows that $h_{f,g(f),w}({\cdot})$ is bounded from below.
For (iv), looking at (2), (19), and (20) we observe that, for each $x, w\in S$ and $n, k \in\mathbb{N}\setminus \{0\}$ with $k \le n$ , the random variables $\mathbf{1}[T_w = k]$ , $\sum_{t=0}^{k-1}(C(X_t,A_t) - g(f))$ , and $\mathbf{1}[T_w \gt n ]$ are $\mathcal{F}_n$ -measurable, and then
where the Markov property was used to set the third equality. Next, notice that on the event $[T_w \le n]$ the equality $X_{n\land T_w} = w$ holds, and in this case $\text{e}^{\lambda h_{f,g(f),w}(X_{n\land T_w})} = \text{e}^{\lambda h_{f,g(f),w}(w)} = 1$ , by part (i). Combining this fact with the previous display, it follows that
completing the proof.
Next, the previous lemma will be used to establish the main result of the section, which will be derived under the following conditions.
Assumption 3. There exists a finite set $F\subset S$ such that the cost function and the action sets satisfy the following properties.
-
(i) For each $x\in S\setminus F$ , the action set A(x) is a singleton, whereas A(x) is finite for $x\in F$ .
-
(ii) If $x\in S\setminus F$ , $C(x,\cdot) = 0$ and $C(x,\cdot) \le 0$ if $x\in F$ .
Theorem 2. Suppose that $\lambda \lt 0$ , and that Assumptions 2 and 3 are valid. In this context, for each $w\in S$ , the following assertions hold.
-
(i) For each stationary policy $\tilde f$ , the cost function $C(\cdot, \tilde f({\cdot}))$ satisfies conditions (a) and (b) in Lemma 5, and there exists $f\in\mathbb{F}$ such that the pair $(g(f),h_{f,g(f),w}({\cdot})) \in ({-}\infty, 0] \times \mathcal{B}(S)$ satisfies
\begin{align*} \text{e}^{\lambda g(f) + \lambda h_{f,g(f),w}(x)} & = \sup_{a\in A(x)}\Bigg[\text{e}^{\lambda C(x,a)}\sum_{y\in S}p_{x,y}(a)\text{e}^{\lambda h_{f,g(f),w}(y)}\Bigg], \cr & = \text{e}^{\lambda C(x,f(x))}\sum_{y\in S}p_{x,y}(f(x))\text{e}^{\lambda h_{f,g(f),w}(y)}, \qquad x\in S. \end{align*} -
(ii) $J_*({\cdot}) = g(f) = J^{\ast}({\cdot})$ , and $g(f) = \lim_{n\to\infty}({1/n})J_n(f, x)$ , $x\in S$ .
Proof. Starting with part (i), notice that Assumption 3 implies that the space $\mathbb{F}$ of stationary policies is finite and that, for each $\tilde f\in\mathbb{F}$ , the nonpositive cost function $C(\cdot,\tilde f({\cdot}))$ has finite support, so that conditions (a) and (b) in Lemma 5 hold for every $\tilde f\in\mathbb{F}$ ; see (36). It follows that, for each $\tilde f \in \mathbb{F}$ , there exists $g(\tilde f)$ such that, for each state w, the pair $(g(\tilde f),h_{\tilde f,g(\tilde f),w}({\cdot}))\in({-}\infty,0]\times\mathcal{B}(S)$ satisfies the conclusions in Lemma 5. Pick $f\in \mathbb{F}$ such that
Now let $w\in S$ be arbitrary but fixed, and note that (38) holds by Lemma 5(ii), so that
Thus, to establish the first assertion it is sufficient to show that the equality holds in the above relation. To achieve this goal, recall that the space A(y) is finite for every $y\in S$ , which allows us to pick $f^{\ast}\in \mathbb{F}$ such that
Combining these two last displays, it follows that, for each state $x\in S$ ,
and then there exists a function $\Delta\colon S\to\mathbb{R}$ such that
and
notice that $g(f) - g(f^{\ast})\le 0$ , by (46), and then
since $\lambda \lt 0$ . To complete the proof of part (i) it is sufficient to show that
since this equality combined with (48) and (50) yields that the equality occurs in (47). To establish (52), recall that the equality $A_n = f^{\ast}(X_n)$ always holds with probability 1 under $f^{\ast}$ , so that, via (50) and the Markov property, it follows that, for each $x\in S$ and $n\in\mathbb{N}$ ,
Now define the positive random variables
Observing that under $f^{\ast}$ the random variable $\sum_{t=0}^{n}[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))]$ is $\mathcal{F}_n$ -measurable, and using the Markov property, it follows that, for every $x\in S$ and $n\in\mathbb{N}$ ,
where the last equality is due to (53). This last relation and (54) lead to $\mathbb{E}_x^{f^{\ast}}[V_{n+1}\mid\mathcal{F}_n] = V_n$ , i.e. $\{(V_n,\mathcal{F}_n)\}_{n\in\mathbb{N}}$ is a martingale with respect to $\mathbb{P}_x^{f^{\ast}}$ . Via the optional sampling theorem it follows that, for every $x\in S$ and $n\in\mathbb{N}$ ,
Recalling that $\mathbb{P}_x^{f^{\ast}}[T_w \lt \infty] = 1$ , by Lemma 1(i), and using that $X_{T_w} = w$ on the event $[T_w \lt \infty]$ , it follows that
Next, we compare $V_n$ with the random variable $W_n(f^{\ast})$ introduced in Lemma 5; see (39). Observe that
where
and the inequality is due to (49) and (51). Thus,
where $W_n(f^{\ast})$ is as in (39). Since the equality
holds $\mathbb{P}_x^{f^{\ast}}$ -a.s. for every $x\in S$ and $n\in\mathbb{N}$ , by Lemma 5(iv), it follows that $\{W_n(f^{\ast})\}_{n\in\mathbb{N}}$ is uniformly integrable with respect to $\mathbb{P}_x^{f^{\ast}}$ , i.e. $\mathbb{E}_x^{f^{\ast}}[W_n(f^{\ast})\mathbf{1}[W_n(f^{\ast}) \gt c]] \to 0$ as $c\to\infty$ uniformly in n; note that $W_n(f ^{\ast}) \ge 0$ , and see, for instance, Theorem 7.6.6, p. 300 of [Reference Ash1], or Theorem 16.14, p. 230 of [Reference Billingsley4]. Since $\|h_{f,g(f),w}-h_{f^{\ast},g(f^{\ast}),w}\|$ is finite, by Lemma 5(iii), the relation $0 \le V_{n\land T_w} \le W_n(f^{\ast})\text{e}^{|\lambda|\,\|h_{f,g(f),w}-h_{f^{\ast},g(f^{\ast}),w}\|}$ obtained from the above display immediately implies that $\{V_{n\land T_w}\}_{n\in\mathbb{N}}$ is uniformly integrable with respect to $\mathbb{P}_x^{f^{\ast}}$ . Combining this property with (55), an application of Theorem 7.5.2, p. 295 in [Reference Ash1] allows us to interchange the limit and the expectation to obtain that, for every $x\in S$ ,
Setting $x = w$ , it follows that
and then
where $Q \,:\!=\, T_{w}\lambda(g(f) - g(f^{\ast})) + \sum_{t=0}^{T_w-1}\Delta(X_t) \ge 0$ , and the inequality is due to (49) and (51). Observe now that $1 = \mathbb{E}_w^{f^{\ast}}\big[\exp\big\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f^{\ast}))\big\}\big]$ by Lemma 5(i), and combining this equality with the two previous expressions it follows that $1 = \mathbb{P}_w^{f^{\ast}}[Q = 0]$ , an equality that, combined with (49) and (51), implies that $\mathbb{P}_w^{f^{\ast}}\big[\sum_{t=0}^{T_w-1}\Delta(X_t) = 0\big]= 1$ .
To conclude, observe that $\sum_{t=0}^{T_w-1}\Delta(X_t) \ge \Delta(X_0) \ge 0$ and $\sum_{t=0}^{T_w-1}\Delta(X_t) \ge \Delta(x)\mathbf{1}[T_x \lt T_w] \ge 0$ , $x \in S\setminus \{w\}$ , and recall that $\mathbb{P}_w^{f^{\ast}}[T_x \lt T_w] \gt 0$ by Lemma 1(ii), whereas $1 = \mathbb{P}_w[X_0= w]$ . Combining these relations with the two previous expressions, it follows that $\Delta({\cdot}) = 0$ . As already mentioned, this completes the proof of part (i). From this point, part (ii) follows via Remark 2(iii).
6. Proof of Theorem 1
In this section the main result of the paper is established. Essentially, the approach used to prove Theorem 1 consists in ‘approximating’ the original model $\mathcal{M}$ by a sequence $\{\mathcal{M}_n\}$ of MDPs satisfying the conditions in Theorem 2. Throughout the remainder of the section Assumptions 1 and 2, as well as the condition $\lambda \lt 0$ , are enforced. The argument relies heavily on the preliminaries established in Section 5, as well as on Lemma 6. To begin with, suppose that the cost function C satisfies
let $\{F_n\}_{n\in\mathbb{N}}$ be a sequence of subsets of S such that
and define the truncated cost function $C_n$ by
Now, let $f_0\in \mathbb{F}$ be a fixed stationary policy, and recall that, for each $x\in S$ , the action set A(x) is a compact subspace of the metric space A, so that there exists a sequence of $\{D_n(x)\}_{n\in\mathbb{N}}$ of subsets of A(x) such that
With this notation, for each $n\in\mathbb{N}$ define the collection $\{A_n(x)\}_{x\in S}$ of action sets as follows:
Next, let
be the MDP obtained from $\mathcal{M}$ after replacing the actions sets $\{A(x)\}_{x\in S}$ and the cost function C by $\{A_n(x)\}$ and $C_n$ , respectively. It follows from (56)–(61) that, for every $n\in\mathbb{N}$ , the conditions in Theorem 2 are satisfied by $\mathcal{M}_n$ , and then there exists a stationary policy $f_n \in \prod_{x\in S}A_n(x)$ such that, given $w\in S$ , the pair
has the following properties:
by Lemma 5 and (56). It follows that $g_n \in [{-}\|C\|, 0]$ and $h_n(x)b\in ({-}\infty, 2M_w\|C\|]$ , $x\in S$ , $n\in\mathbb{N}$ . Via Cantor’s diagonal method, it is possible to determine a subsequence of $(g_n,h_n({\cdot}))$ (by notational convenience, still denoted by $(g_n,h_n({\cdot}))$ ) which is convergent:
The following lemma, using the conditions and definitions in (56)–(65), is the last step before the proof of Theorem 1.
Lemma 6. The pair $(g^{\ast},h^{\ast}({\cdot}))$ in (65) satisfies the following properties.
-
(i) $h^{\ast}({\cdot})$ is a finite function, and
(66) \begin{equation} \text{e}^{\lambda g^{\ast} + \lambda h^{\ast}(x)} \ge \sup_{a\in A(x)}\Bigg[\text{e}^{\lambda C(x,a)} \sum_{y\in S}p_{x,y}(a)e^{\lambda h^{\ast}(y)}\Bigg], \qquad x\in S, \end{equation}and then $(g^{\ast}, h^{\ast}({\cdot})) \in \tilde{\mathcal{G}}$ ; see Definition 1 and (24). -
(ii) For each $x\in S$ ,
(67) \begin{equation} J_*(x) = g^{\ast} = J^{\ast}(x). \end{equation}
Proof. For (i), let $x\in S$ be arbitrary but fixed and select $n_x\in\mathbb{N}$ such that $x\in F_k$ if $k\ge{n_x}$ . Next, let $d\in D(x)$ be arbitrary, so that there exists $n_{(d)}\in \mathbb{N}$ such that $d\in D_k(x)$ for $k\ge n_{(d)}$ . It follows from (60) that $d\in A_n(x)$ for $n \ge \max\{n_x, n_{(d)}\}$ , and then (63) implies
where the equality is due to the definition of $C_n$ in (58). Taking the inferior limit as n goes to $\infty$ in this relation, (65) and Fatou’s lemma together lead to
Next, given $a\in A(x)$ , observe that (59) implies that there exists a sequence $\{d_k\}_{k\in\mathbb{N}}$ contained in D(x) such that $\lim_{k\to\infty}d_k = a$ ; replacing d by $d_k$ in the above display and taking the inferior limit as k goes to $\infty$ , from Fatou’s lemma and the continuity conditions in Assumption 1, it follows that
Since $x\in S$ and $a\in A(x)$ are arbitrary, it follows that (66) holds. To complete the proof of part (i) we show that $h^{\ast}({\cdot})$ is a finite function. To achieve this goal, pick a stationary policy $f \in \mathbb{F}$ and define the sequence $\{\mathcal{S}_n\}$ of subsets of S as follows:
and notice that $S = \bigcup_{k=0}^\infty \mathcal{S}_k$ , by Lemma 1(ii). Using (65), to show that $h^{\ast}({\cdot})$ is finite it is sufficient to verify that, for each $k \in \mathbb{N}$ ,
a claim that will be proved by induction. Notice that $h^{\ast}(w) = 0$ , by (64) and (65), and then the desired conclusion holds for $k=0$ . Suppose that (68) is valid for some $k\in\mathbb{N}$ , let $\hat y\in\mathcal{S}_{k+1}$ , and pick $\hat x\in \mathcal{S}_k$ such that
Now observe that (66) implies that
Recalling that $\lambda \lt 0$ , the condition $h^{\ast}(\hat x) \gt -\infty$ implies that the left-most term in the above relation is finite, and then (69) allows us to conclude that $\text{e}^{\lambda h^{\ast}(\hat y)} \lt \infty$ , so that $h^{\ast}(\hat y) \gt -\infty$ ; since $\hat y\in \mathcal{S}_{n+1}$ is arbitrary, it follows that (68) holds with $k+1$ instead of k, concluding the induction argument. As already mentioned, this completes the proof of part (i).
For (ii), for each $n\in\mathbb{N}$ let $J^{*, n}({\cdot})$ be the superior limit optimal average cost function for model $\mathcal{M}_n$ . In this case, (62) and part (ii) of Theorem 2 yield that $g_n = J^{*,n}({\cdot})$ . Moreover, if $f_n$ is the stationary policy in (63), then Theorem 2(ii) applied to model $\mathcal{M}_n$ yields
see (7). Observe now that $C_n \ge C$ , since the cost function is negative, so that the above display leads to
see (8) and (9). From this point, (65) leads to $g^{\ast} \ge J^{\ast}({\cdot})$ . Recall now that $(g^{\ast},h^{\ast}({\cdot}))\in\tilde{\mathcal{G}}$ , by Lemma 6(ii), so that $g^{\ast}\le J_*({\cdot})$ , by Lemma 2(iii). Combining these two expressions with the inequality $J_*({\cdot}) \le J^{\ast}({\cdot})$ in (12), it follows that $J_*({\cdot}) = g^{\ast} = J^{\ast}({\cdot})$ .
The proof of the main result is finally presented below.
Proof of Theorem 1. Let $\lambda \lt 0$ be fixed, and suppose that Assumptions 1 and 2 hold. The proof has been split into two cases.
Case 1: $C(x, a)\le 0$ for every $(x, a)\in\mathbb{K}$ . Let $g^{\ast} \in \mathbb{R}$ and $h^{\ast}({\cdot})$ be as in (65). For (i), suppose that $g\in \mathcal{G}$ . In this case, $(g,h({\cdot}))\in\tilde{\mathcal{G}}$ for some function $h({\cdot})$ , and then $g \le J_*({\cdot})$ by Lemma 2(iii). Assume that $g \le J_*({\cdot})$ . Since $J_*({\cdot}) = g^{\ast}$ , by Lemma 6(ii) it follows that $g\le g^{\ast}$ , and then $\lambda g \ge \lambda g^{\ast}$ , since $\lambda$ is negative. Combining this last inequality with (66), and recalling that $h^{\ast}({\cdot})$ is a finite function (by Lemma 6(ii)), it follows that $(g,h^{\ast}) \in \tilde{\mathcal{G}}$ , and then $g \in \mathcal{G}$ ; see Definition 1 and (24).
For (ii), since $J_*({\cdot}) = g^{\ast}$ , by Lemma 6(ii), part (i) implies that $g \in \mathcal{G} \iff g \le g^{\ast}$ ; then $g^{\ast} = \sup\{g \mid g\in\mathcal{G}\}$ , and the desired conclusion follows from equality (67) established in Lemma 6(ii).
For (iii), since $h^{\ast}({\cdot})$ is finite, by Lemma 6(i), and $g^{\ast} = \sup\{g\mid g\in\mathcal{G}\}$ , by part (ii), the conclusion follows from inequality (66) established in Lemma 6(i).
For (iv), consider the model $\mathcal{M}_n$ in (61), let $(g_n, h_n({\cdot})) \in [{-}\|C\|, 0] \times \mathcal{B}(S)$ , and let $f_n\in\mathbb{F}$ be as in (62) and (63). It follows from Remark 2(iii) that $g_n = \lim_{k\to\infty}(1/k)J_k^{(n)}(f_n,x)$ , $x\in S$ , where $J_k^{(n)}(f_n,x)$ is the ( $\lambda$ -)certainty equivalent of the random cost $\sum_{t=0}^{k-1}C_n(X_t,A_t)$ with respect to $\mathbb{P}_x^{f_n}$ , so that, for each initial state x,
We show that policy $f_n$ is $\varepsilon$ -optimal if n is large enough. To achieve this goal, using that $C\le 0$ , observe that $C \le C_n$ , by (58), and then the above display implies that
i.e. $g_n \ge J(f_n,x) \ge J_-(f_n,x) \ge J_*(x)$ , $x\in S$ , $n\in\mathbb{N}$ ; see (7)–(11). Since $J_*({\cdot}) = g^{\ast}$ , by part (iii), using that $g_n\to g^{\ast}$ as $n\to\infty$ , pick $n_*\in\mathbb{N}$ such that $g^{\ast} + \varepsilon \ge g_{n_*}$ to conclude from the above display that $g^{\ast} + \varepsilon \ge J(f_{n^{\ast}},\cdot) \ge J_-(f_{n^{\ast}},\cdot) \ge g^{\ast}$ , so that $f_{n^{\ast}}\in\mathbb{F}$ is $\varepsilon$ -optimal.
Case 2: $C \in \mathcal{B}(\mathbb{K})$ . The notation in Lemma 1 and Remark 3 will be used. For (i), consider the cost function $C - \|C\|$ , which is nonpositive. Using Remark 3, note that
where the second equivalence is due to part (i) in Case 1 applied to the cost function $C - \|C\|$ , and Lemma 1 was used in the last step. It follows that $g \in \mathcal{G}(C) \iff g \le J_*(C,\cdot)$ .
For (ii), an application of part (ii) of Case 1 to the cost function $C - \|C\|\le 0$ yields
a relation that, via (13) and (23), leads to $J_*(C,\cdot) = \sup\{g \mid g \in \mathcal{G}(C)\} = J^{\ast}(C,\cdot)$ .
For (iii), since $C - \|C\| \le 0$ , the third part of Case 1 yields that
since $\mathcal{G}(C-\|C\|) = \mathcal{G}(C) -\|C\|$ , it follows that $\sup\{g \mid g \in \mathcal{G}(C)\} \in \mathcal{G}(C)$ .
For (iv), let $\varepsilon \gt 0$ be fixed. Applying the fourth part of Case 1 to the cost function $C - \|C\| \le 0$ , there exists $f_\varepsilon\in\mathbb{F}$ such that
a relation that, via (13) and (23), leads to $g^{\ast}+ \varepsilon \ge J(C,f_\varepsilon,\cdot) \ge J_-(C,f_\varepsilon,\cdot) \ge g^{\ast}$ , where $g^{\ast} = \sup\{g \mid g \in \mathcal{G}(C)\}$ , showing that $f_\varepsilon$ is $\varepsilon$ -optimal for the cost function C.
Remark 4. Finally, it is important to point out that the literature on risk-seeking is scarce. The problem addressed in this manuscript is interesting by itself. The case $\lambda\lt 0$ is associated with a risk-loving (risk-seeking) controller. A risk-seeking controller is willing to accept greater economic uncertainty in exchange for higher returns. Using a prospect theory approach, it has been found that people are risk-averse in the domain of gains but become risk-seeking in the domain of losses; see, for instance, [Reference Zaleskiewicz22].
7. Concluding remarks
This work has studied Markov decision chains endowed with average cost criteria. Besides standard continuity–compactness conditions, three essential assumptions determined the framework of the paper: (i) the simultaneous Doeblin condition holds, (ii) the state space is communicating under the action of each stationary policy, and (iii) the decision maker driving the system is risk-seeking. Within this context, a characterization of the optimal average cost was provided in Theorem 1, extending conclusions in [Reference Cavazos-Cadena7] obtained under the condition that the controller is risk-averse. Another interesting result stated in part (iv) of the main theorem is the existence of $\varepsilon$ -optimal stationary policies, which was obtained by truncating the cost function, instead of taking a policy that ‘almost optimizes’ the term within brackets of the optimality inequality (25). In this direction there is, at least, an interesting question to be analyzed in the future: Given $\varepsilon \gt 0$ , assume that $f\in \mathbb{F}$ is derived from the optimality inequality in such a way that action f(x) is an $\varepsilon |\lambda|$ -maximizer of the logarithm of the term within brackets in (25), i.e. $\text{e}^{\lambda(g^{\ast}+\varepsilon) + \lambda h^{\ast}(x)} \le \big[\text{e}^{\lambda C(x,f(x))}\sum_{y\in S}p_{x,y}(f(x))\text{e}^{\lambda h^{\ast}(y)}\big]$ for every $x\in S$ . With this notation, establishing that f is $\varepsilon$ -optimal is an interesting problem.
Acknowledgements
The authors are deeply grateful to the reviewers and the Associate Editor for their careful reading of the original manuscript and for their advice to improve the paper.
H. Cruz-Suárez and R. Montes-de-Oca dedicate this article to the memory of their collaborator and the co-author of the present work, Rolando Cavazos-Cadena, whose unfortunate death occurred on May 24, 2023.
Funding information
This work was partially supported by the PSF Organization under Grant No. 0116-18.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.