Characterization of the optimal average cost in Markov decision chains driven by a risk-seeking controller

Rolando Cavazos-Cadena; Hugo Cruz-Suárez; Raúl Montes-de-Oca

doi:10.1017/jpr.2023.40

Characterization of the optimal average cost in Markov decision chains driven by a risk-seeking controller

Part of: Control systems Stochastic systems and control

Published online by Cambridge University Press: 21 July 2023

Rolando Cavazos-Cadena ,

Hugo Cruz-Suárez

and

Raúl Montes-de-Oca

Show author details

Rolando Cavazos-Cadena*: Affiliation:
Universidad Autónoma Agraria Antonio Narro
Hugo Cruz-Suárez*: Affiliation:
Benemérita Universidad Autónoma de Puebla
Raúl Montes-de-Oca*: Affiliation:
Universidad Autónoma Metropolitana-Iztapalapa
*: *Postal address: Departamento de Estadística y Cálculo, Universidad Autónoma Agraria Antonio Narro, Boulevard Antonio Narro 1923, Buenavista, COAH 25315, México. Email: [email protected]
**Postal address: Facultad de Ciencias Físico-Matemáticas, Benemérita Universidad Autónoma de Puebla, Ave. San Claudio y Río Verde, Col. San Manuel CU, PUE 72570, México. Email: [email protected]
***Postal address: Departamento de Matemáticas, Universidad Autónoma Metropolitana-Iztapalapa, Av. Ferrocarril San Rafael Atlixco 186, Col. Leyes de Reforma Primera Sección, Alcaldía Iztapalapa, CDMX 09310, México. Email: [email protected]

Article contents

Abstract
Introduction
Decision model
Main result
Fundamental tools
Optimality equation in a special case
Proof of Theorem
Concluding remarks
Funding information
Competing interests
References

Rights & Permissions

Abstract

This work concerns Markov decision chains on a denumerable state space endowed with a bounded cost function. The performance of a control policy is assessed by a long-run average criterion as measured by a risk-seeking decision maker with constant risk-sensitivity. Besides standard continuity–compactness conditions, the framework of the paper is determined by the following conditions: (i) the state process is communicating under each stationary policy, and (ii) the simultaneous Doeblin condition holds. Within this framework it is shown that (i) the optimal superior and inferior limit average value functions coincide and are constant, and (ii) the optimal average cost is characterized via an extended version of the Collatz–Wielandt formula in the theory of positive matrices.

Keywords

Risk-lover controller exponential utility optimality inequality extended Collatz–Wielandt formula optional sampling theorem truncated cost function

MSC classification

Primary: 93E20: Optimal stochastic control

Secondary: 93C55: Discrete-time systems

Type: Original Article
Information: Journal of Applied Probability , Volume 61 , Issue 1 , March 2024 , pp. 340 - 367

DOI: https://doi.org/10.1017/jpr.2023.40 [Opens in a new window]
Copyright: © The Author(s), 2023. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

This work is concerned with Markov decision processes (MDPs) on a denumerable state space endowed with a bounded cost function. It is assumed that the decision maker is risk-seeking with a constant risk-sensitivity coefficient, and the performance of a control policy is measured by a long-run average cost index. In addition to standard continuity–compactness conditions, the framework of the paper is mainly determined by two assumptions on the transition structure: (a) the state process is communicating under each stationary policy, and (b) the simultaneous Doeblin condition holds. In this context, the main objectives of this work can be described as follows:

(i) to prove that the optimal inferior and superior limit average cost functions coincide and are constant;
(ii) to characterize the optimal average cost.

The results for these problems extend to the present framework conclusions established in [Reference Cavazos-Cadena7] under the condition that the decision maker is risk-averse and, as in that paper, the characterization of the optimal average cost presented in Theorem 1(ii) is a generalized version of the Collatz–Wielandt formula for the largest eigenvalue of a positive matrix. On the other hand, an additional objective of the paper is the following:

(iii) to establish the existence of a stationary policy whose average index differs from the optimal one by less than a given tolerance.

The analysis of discrete-time Markov models endowed with a risk-sensitive average criterion can be traced back, at least, to the seminal paper [Reference Howard and Matheson14]. In that paper the Perron–Frobenius theory of positive matrices [Reference Meyer17] was employed to study MDPs with finite state and action spaces, and the optimal average cost was characterized via an optimality equation. On the other hand, motivated by important connections with the theory of large deviations and mathematical finance, risk-sensitive average criteria have recently been studied. Models with a countable state space are considered in [Reference Borkar and Meyn5, Reference Cavazos-Cadena and Fernández-Gaucherand8, Reference Denardo and Rothblum9, Reference Sladký20], whereas MDPs on a general state space are analyzed in [Reference Di Masi and Stettner10, Reference Di Masi and Stettner11, Reference Di Masi and Stettner12, Reference Jaśkiewicz15]. Connections of risk-sensitive average criteria with the theory of large deviations are presented in [Reference Balaji and Meyn2, Reference Kontoyiannis and Meyn16], whereas applications to mathematical finance are given in [Reference Bäuerle and Reider3, Reference Pitera and Stettner18, Reference Stettner21].

On the other hand, there are important differences between the risk-neutral and risk-sensitive average criteria. For instance, in the risk-neutral case the simultaneous Doeblin condition ensures that the optimal average cost is constant and is characterized via an optimality equation [Reference Hernández-Lerma13, Reference Puterman19], a result that is not generally valid in the risk-sensitive case [Reference Cavazos-Cadena6]. For this reason, in this work the simultaneous Doeblin condition will be supplemented with the requirement that the state space is communicating under the action of any stationary policy.

The remainder of the paper is organized as follows: In Section 2 the decision model is formally introduced, the basic assumptions of the paper are formulated, and the risk-sensitive average criteria are defined. Next, in Section 3 the idea of a subsolution of the optimality equation is formulated and the main conclusions of the paper are stated in Theorem 1, which is followed by a brief outline of the strategy used to prove that result. Section 4 contains the basic properties of the family of subsolutions which allow us to establish the fundamental auxiliary result in Theorem 2 of Section 5, establishing that if the space of stationary policies is finite and the cost function is nonpositive with compact support, then the optimality equation has a bounded solution. After those preliminaries, Theorem 1 is finally proved in Section 6, and the paper concludes with some brief comments in Section 7.

1.1. Notation

In the following, the set of nonnegative integers is denoted by $\mathbb{N}$ , and, for a real-valued function h, $\|h\|\,:\!=\, \sup\{|h(x)| \colon \hbox{belongs to the domain of } h\}$ is the corresponding supremum norm. Given a nonempty set S, the space of all bounded real-valued functions defined on S is denoted by $\mathcal{B}(S)$ , i.e. $h\colon S\rightarrow \mathbb{R}$ belongs to $\mathcal{B}(S)$ if and only if $\|h\| \lt\infty$ . On the other hand, for real numbers a and b, $a\land b = \min \{a, b\}$ , and, if $F\subset S$ , the indicator function of the subset F is denoted by $\mathbf{1}_F$ , that is, $\mathbf{1}_F(y) \,:\!=\,1$ , $y\in F$ , and $\mathbf{1}_F(y) = 0$ , $y\in S\setminus F$ . For an event W, the corresponding indicator function is denoted by $\mathbf{1}[W]$ , and every relation between random variables holds almost surely with respect to the underlying probability measure.

2. Decision model

Throughout, $\mathcal{M}=(S, A, \{A(x)\}_{x\in S}, C, P)$ represents an MDP, a model for a dynamical system whose components are as follows:

(i) The state space S is a denumerable set endowed with the discrete topology.
(ii) The control set A is a metric space.
(iii) For each state $x\in S$ , the nonempty set $A(x)\subset A$ is the class of possible actions at x.
(iv) $C\colon \mathbb{K}\to \mathbb{R}$ is the cost function, where $\mathbb{K}\,:\!=\, \{(x, a)\mid a\in A(x), x\in S\}$ is the collection of admissible pairs.
(v) $P=[p_{x,y}({\cdot})]$ is the controlled transition law.

This model $\mathcal{M}$ is interpreted as follows: At each time $t\in \mathbb{N}$ the state of a dynamical system is observed, say $X_t=x\in S$ , and a decision maker (controller) chooses and applies an action $A_t=a\in A(x)$ . As a consequence of such an intervention, a cost $C(X_t, A_t) = C(x,a)$ is incurred and, regardless of the previous states and actions, the state of the system at time $t+1$ will be $X_{t+1}= y\in S$ with probability $p_{X_t,X_{t+1}}(A_t)= p_{x,y}(a)$ , where $\sum_{y\in S} p_{x, y}(a) = 1$ ; this is the Markov property of the decision model.

Assumption 1.

(i) For every $x\in S$ , A(x) is a compact subset of A.
(ii) For each $x, y\in S$ , the mappings $a\mapsto p_{x, y}(a)$ and $a\mapsto C(x,a)$ are continuous in $a\in A(x)$ .
(iii) The cost function C is bounded, i.e. $C\in\mathcal{B}(\mathbb{K})$ .

The observed history of the process up to time n is denoted by

(1)

\begin{equation} H_n \,:\!=\, (X_0, A_0, X_1, A_1, \ldots, X_{n-1}, A_{n-1}, X_n), \qquad n=0,1,2,3,\ldots,\end{equation}

whereas

(2)

\begin{equation}\mathcal{F}_n \,:\!=\, \sigma(H_n)\end{equation}

is the $\sigma$ -field generated by $H_n$ .

2.1. Policies

A policy is a measurable rule for choosing actions. Formally, a control policy $\pi = \{\pi_n\}_{n\in\mathbb{N}}$ is a (special) sequence of stochastic kernels $\pi_n$ on the action space A given $\mathbb{H}_n$ , where the space $\mathbb{H}_n$ of possible histories up to time n is given by $\mathbb{H}_0 = S$ , and $\mathbb{H}_n = \mathbb{K}^{n}\times S$ for $n\ge 1$ ; the vector $\textbf{h}_n= (x_0, a_0,\ldots, x_{n-1}, a_{n-1}, x_n)$ represents a generic element of $\mathbb{H}_n$ , so that $a_i \in A(x_i)$ for $i \lt n$ and $x_i \in S$ for $i\le n$ . When the controller drives the system using policy $\pi=\{\pi_n\}$ , given that the history $\mathbb{H}_n$ attains the value $\textbf{h}_n\in \mathbb{H}_n$ , the probability of choosing action $A_n$ inside a Borel subset B of $A(x_n)$ is $\pi_n(B\mid\textbf{h}_n)$ , where $\pi_n(A(x_n)\mid\textbf{h}_n) = 1$ . The family of all policies is denoted by $\mathcal{P}$ and, given the initial state $X_0 = x$ and the policy $\pi\in \mathcal{P}$ used to direct the system, the distribution $\mathbb{P}_x^\pi$ of the state–action process $\{(X_k, A_k)\}_{k\in \mathbb{N}}$ is uniquely determined [Reference Hernández-Lerma13, Reference Puterman19] and $\mathbb{E}_x^\pi[\,\cdot\,]$ represents the corresponding expectation operator. Next, set $\mathbb{F}\,:\!=\, \prod_{x\in S}A(x)$ , so that $\mathbb{F}$ is a compact metric space consisting of all functions $f\colon S\to A$ such that $f(x)\in A(x)$ for every $x\in S$ . A policy $\pi$ is stationary if there exists $f\in\mathbb{F}$ such that, under $\pi$ , at each time $t\in\mathbb{N}$ the action $A_t=f(X_t)$ is applied. The class of stationary policies and $\mathbb{F}$ are naturally identified, a convention allowing us to write $\mathbb{F}\subset \mathcal{P}$ .

2.2. Utility function and average criteria

In this paper we suppose that the controller has a constant risk-sensitivity coefficient $\lambda \ne 0$ , so that a random cost Y is assessed via the expected value of $U_{\lambda}(Y)$ , where the (dis-)utility function $U_{\lambda}$ is the strictly increasing mapping given by

(3)

\begin{equation} U_{\lambda}(x) = \text{sign}(\lambda)\text{e}^{\lambda x}, \qquad x\in \mathbb{R}; \end{equation}

notice that

(4)

\begin{equation} U_{\lambda}(c+h) = \text{e}^{\lambda c} U_{\lambda}(h), \qquad c, h\in \mathbb{R}.\end{equation}

When the decision maker has to choose between two random costs $Y_0$ and $Y_1$ , paying $Y_0$ will be preferred if $\mathbb{E}[U_{\lambda}(Y_1)] \gt\mathbb{E}[U_{\lambda}(Y_0)]$ , whereas the controller will be indifferent between both costs if $\mathbb{E}[U_{\lambda}(Y_1)]= \mathbb{E}[U_{\lambda}(Y_0)]$ . Given a random cost Y such that $\mathbb{E}[U_{\lambda} (Y)]$ is finite, the certainty equivalent of Y with respect to $U_{\lambda}$ is the unique real number $\mathcal{E}_{\lambda}[Y]$ determined by $U_{\lambda}(\mathcal{E}_{\lambda}[Y]) = \mathbb{E}[U_{\lambda}(Y)]$ , so that the controller will be willing to pay the amount $\mathcal{E}_{\lambda}[Y]$ to avoid facing the uncertainty associated with Y; note that

(5)

\begin{equation} \mathcal{E}_{\lambda}[Y] = U_{\lambda}^{-1}(\mathbb{E}[U_{\lambda}(Y)]) = \frac{1}{\lambda}\log(\mathbb{E}[\text{e}^{\lambda Y}]),\end{equation}

a relation that immediately leads to

(6)

\begin{equation} \mathbb{P}[|Y| \le b]= 1 \implies |\mathcal{E}_{\lambda} (Y)|\le b.\end{equation}

Observe that $U_{\lambda}({\cdot})$ is strictly convex (resp. concave) if $\lambda \gt 0$ (resp. $\lambda \lt 0$ ), and in that case Jensen’s inequality yields that $\mathcal{E}_{\lambda}(Y) \gt \mathbb{E}[Y] $ (resp. $\mathcal{E}_{\lambda}(Y) \lt \mathbb{E}[Y]$ ) if Y is a non-constant random variable. When $\lambda \gt 0$ (resp. $\lambda \lt 0$ ) the controller is referred to as risk-averse (resp. risk-seeking). Using the idea of the certainty equivalent, risk-sensitive average criteria can be defined as follows: Let $\pi\in\mathcal{P}$ be the policy used to operate the system starting at $X_0 = x$ . The application of the first n actions $A_0, A_1,\ldots, A_{n-1}$ generates the cost $\sum_{k=0}^{n-1} C(X_k, A_k)$ and, using (5), the associated certainty equivalent is

(7)

\begin{equation} J_n(\pi, x) \,:\!=\, \frac{1}{\lambda} \log\Bigg(\mathbb{E}_x^\pi \Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{n-1} C(X_t, A_t)\Bigg\}\Bigg]\Bigg), \qquad n=1,2,3,\ldots,\end{equation}

which represents an average of $J_n(\pi, x)/n$ per step. The superior limit ( $\lambda$ -sensitive) average performance index of policy $\pi$ at state x is given by

(8)

\begin{equation} J(\pi, x) \,:\!=\, \limsup_{n\to\infty} \frac{1}{n } J_n(\pi,x),\end{equation}

and the corresponding optimal value function is

(9)

\begin{equation} J^{\ast}(x) \,:\!=\, \inf_{\pi\in\mathcal{P}} J(\pi, x), \qquad x\in S;\end{equation}

a policy $\pi^{\ast}\in \mathcal{P}$ is $\limsup$ ( $\lambda$ -)average optimal if $J(\pi^{\ast}, x)=J^{\ast}(x)$ for every $x\in S$ . The criterion (8) represents the worst-case point of view about the behavior of the sequence $\{J_n(\pi, x)/n\}$ of averages over a finite horizon. The optimistic perspective assesses the performance of $\pi$ via the smallest limit point of that sequence:

(10)

\begin{equation} J_{-}(\pi, x) \,:\!=\, \liminf_{n\to\infty} \frac{1}{n } J_n(\pi, x),\end{equation}

with the corresponding optimal value function given by

(11)

\begin{equation} J_{\ast}(x) \,:\!=\, \inf_{\pi\in\mathcal{P}} J_{-}(\pi, x),\qquad x\in S,\end{equation}

and $\pi_{\ast}\in\mathcal{P}$ is $\liminf$ ( $\lambda$ -)average optimal if $J _-(\pi_*, \cdot) = J_*({\cdot})$ . It follows that

(12)

\begin{equation} J_{-}(\pi, \cdot) \le J(\pi, \cdot), \quad \pi\in\mathcal{P},\qquad \text{and then } J_{\ast}({\cdot})\le J^{\ast}({\cdot}). \end{equation}

From the definitions in (7)–(11), using (6), it is not difficult to verify that

\begin{equation*} -\|C\| \le J_{-}(\pi, \cdot)\le J(\pi, \cdot) \le \|C\|, \quad \pi\in\mathcal{P}, \qquad -\|C\| \le J_*( \cdot)\le J^{\ast}( \cdot) \le \|C\|.\end{equation*}

Remark 1. In Section 6, several cost functions are considered simultaneously. In that case, the cost function is explicitly indicated in the above notation for the average criteria and optimal value functions. For instance, $J_*(C, \cdot)$ and $J ^{\ast}(C, \cdot)$ are used instead of $J_*({\cdot})$ and $J^{\ast}({\cdot})$ ; notice that

(13)

\begin{equation} J_*(C+ \beta, \cdot) = J_*(C, \cdot) + \beta, \quad J^{\ast}(C+ \beta, \cdot) = J^{\ast}(C, \cdot) + \beta, \qquad \beta\in\mathbb{R}. \end{equation}

2.3. Optimality equation

Under appropriate requirements, the characterization of the optimal value functions in (9) and (11) can be based on the following optimality equation:

(14)

\begin{equation} U_{\lambda}(g + h(x)) = \inf_{a\in A(x)}\Bigg[\sum_{y\in S} p_{x, y}(a) U_{\lambda}(C(x, a) + h(y))\Bigg], \quad x\in S, \end{equation}

where $g\in\mathbb{R}$ and $h({\cdot})$ is a function defined on S. Suppose that this equality holds, and set

(15)

\begin{equation} Y_0 = U_{\lambda}(h(X_0)), \qquad Y_ n = U_{\lambda} \Bigg(\sum_{t=0}^{n-1} (C(X_t, A_t) - g) + h(X_n)\Bigg), \quad n=1,2,3,\ldots\end{equation}

Remark 2. Direct consequences of (15) are the following facts:

(i) Let $\pi = \{\pi_n\}\in\mathcal{P}$ be a fixed policy. From (14), it follows that, for every $n\in \mathbb{N}$ and $a\in A(X_n)$ , $U_{\lambda}( g + h(X_n))\le \sum_{y\in S} p_{X_n, y}(a) U_{\lambda}(C(X_n, a) + h(y))$ ; integrating both sides of this relation with respect to $\pi_n(\text{d} a \mid H_n)$ and using the Markov property, it follows that
(16) \begin{align} U_{\lambda} (h(X_n)) & \le \int_{A(X_n)} \sum_{y\in S} p_{X_n, y}(a) U_{\lambda}(C(X_n, a) + h(y))\pi_n(\text{d} a\mid H_n) \cr & = \mathbb{E}_x^\pi[U_{\lambda}(C(X_n, A_n)- g + h(X_{n+1})) \mid \mathcal{F}_n], \quad x\in S, \ n\in\mathbb{N}. \end{align}
Next, using that $\exp\big\{\lambda\sum_{t=0} ^{n-1} (C(X_t, A_t) - g)\big\}$ is $\mathcal{F}_n$ -measurable, (4) and the two previous displays together yield
(17) \begin{align} Y_n & = U_{\lambda}\Bigg(\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + h(X_n)\Bigg) \nonumber \\ & = \exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g)\Bigg\}U_{\lambda}(h(X_n)) \nonumber \\ & \leq \exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g)\Bigg\} \mathbb{E}_x^\pi[U_{\lambda}(C(X_n, A_n)-g+h(X_{n+1})) \mid \mathcal{F}_n] \nonumber \\ & = \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t, A_t)-g)\Bigg\} U_{\lambda}(C(X_n, A_n)-g+h(X_{n+1})) \mid \mathcal{F}_n\Bigg] \nonumber \\ & = \mathbb{E}_x^\pi\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^n(C(X_t, A_t)-g)+h(X_{n+1})\Bigg) \mid \mathcal{F}_n\Bigg] = \mathbb{E}_x^\pi[Y_{n+1} \mid \mathcal{F}_n], \end{align}
that is, $\{(Y_n,\mathcal{F}_n)\}_{n\in \mathbb{N}}$ is a submartingale with respect to $P_x^\pi$ for every $x\in S$ and $\pi\in\mathcal{P}$ .
(ii) Suppose that function $h({\cdot})$ in (14) is bounded. In this context, using Assumption 1, it is not difficult to see that, for each $x\in S$ , the term within brackets in (14) is a continuos function of $a\in A(x)$ , and then it has a maximizer $f^{\ast}(x)\in A(x)$ . Thus, $U_{\lambda}(g + h(x)) = \mathbb{E}_x^{f^{\ast}}[U_{\lambda}(C(x, f^{\ast}(x)) + h(x))]$ for every $x\in S$ and, paralleling the argument in part (i), it follows that the equality always holds in (16) and (17) when $\pi$ is replaced by $f^{\ast}$ , so that $\{(Y_n,\mathcal{F}_n)\}_{n\in \mathbb{N}}$ is a martingale with respect to $\mathbb{P}_x^{f^{\ast}}$ for every $x\in S$ .
(iii) Maintaining the assumption that $h({\cdot})$ is bounded, observe that, for every $n\in \mathbb{N}\setminus \{0\}$ , $x\in S$ , and $\pi\in\mathcal{P}$ , part (i) implies that $\mathbb{E}_x^\pi[Y_0] \le \mathbb{E}_x^\pi[Y_n]$ , an inequality that, using (15), is equivalent to $U_{\lambda}(h(x)) \le \mathbb{E}_x^\pi\big[U_{\lambda}\big(\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + h(X_{n+1})\big)\big]$ ; recalling that $U_{\lambda}({\cdot})$ is increasing, via (4) and (7) it follows that
\begin{align*} U_{\lambda}(h(x)) & \le \mathbb{E}_x^\pi\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + \|h\|\Bigg)\Bigg] \\ & = \text{e}^{\lambda(\|h\| - ng)}\mathbb{E}_x^\pi\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}C(X_t, A_t)\Bigg)\Bigg] \\ & = \text{e}^{\lambda(\|h\| - ng)}U_{\lambda}(J_n(\pi, x)) = U_{\lambda}(J_n(\pi, x) - ng + \|h\|). \end{align*}
Thus, $h(x) \le J_n(\pi, x) - ng + \|h\|$ , and then $g \le (1/n)(J_n(\pi, x) + \|h\| - h(x))$ for every positive integer n. Therefore,
(18) \begin{equation} g \le J_-(\pi, \cdot), \quad \pi\in\mathcal{P}, \qquad \hbox{so that } g \le J_*({\cdot}), \end{equation}

by (10) and (11). On the other hand, part (ii) above yields $\mathbb{E}_x^{f^{\ast}}[Y_0] = \mathbb{E}_x^{f^{\ast}}[Y_n]$ , which is equivalent to $U_{\lambda}(h(x)) = \mathbb{E}_x^{f^{\ast}}\big[U_{\lambda}\big(\sum_{t=0}^{n-1}(C(X_t, A_t)-g) + h(X_{n+1})\big)\big]$ ; paralleling the above argument, it follows that
\begin{align*} U_{\lambda}(h(x)) & \geq \mathbb{E}_x^{f^{\ast}}\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}(C(X_t, A_t)-g)-\|h\|\Bigg)\Bigg] \\ & = \text{e}^{-\lambda(ng+\|h\|)}\mathbb{E}_x^{f^{\ast}}\Bigg[U_{\lambda}\Bigg(\sum_{t=0}^{n-1}C(X_t, A_t)\Bigg)\Bigg] \\ & = \text{e}^{-\lambda(ng+\|h\|)}U_{\lambda}(J_n(f^{\ast}, x)) = U_{\lambda}({-}(ng+\|h\|) + J_n(f^{\ast}, x)). \end{align*}
Therefore, $h(x) \ge J_n(f^{\ast} , x) - (ng + \|h\|)$ , and then $g \ge (1/n)(J_n(f^{\ast}, x) - (\|h\| + h(x)))$ ; via (8) and (9), it follows that $g \ge J(f^{\ast}, x) \ge J^{\ast}(x)$ , $x\in S$ . Combining this relation with (12) and (18), it follows that $g \le J_{*} \le J^{\ast}({\cdot})\le g$ , so that $J_*({\cdot}) = g = J^{\ast}({\cdot})$ and $J_-(f^{\ast}, \cdot) = g = J(f^{\ast}, \cdot)$ , i.e. $g = \lim_{n\to\infty}n^{-1}J_n(f^{\ast},\cdot)$ .

2.4. Assumptions for a constant average cost

Even under strong communication-ergodicity conditions, like those in Assumption 2 below, in the present context of a denumerable state space the existence of a solution of the optimality equation cannot be generally guaranteed for an arbitrary bounded cost function; an (uncontrolled) example illustrating this phenomenon was given in Section 9 of [Reference Cavazos-Cadena7]. However, as we verify later, the strong requirements imposed below ensure that the optimal value functions are constant. To continue, it is convenient to introduce the following notation.

For each set $F \subset S$ , define the first return time to F by

(19)

\begin{equation} T_F \,:\!=\, \min\{n \ge 1 \mid X_n\in F\},\end{equation}

where, by convention, the minimum of the empty set is $\infty$ ; if $F = \{x\}$ is a singleton, the simpler notation

(20)

\begin{equation} T_x\equiv T_{\{x\}}\end{equation}

is used. Combining the above definition with (1) and (2), it follows that $[T_F= n]\in\mathcal{F}_n$ for every $n\in\mathbb{N}$ , so that $T_F$ is a stopping time with respect to the filtration $\{\mathcal{F}_n\}$ .

Assumption 2. There exists $z\in S$ such that the following properties hold.

(i) Accessibility from z:

Under the action of any stationary policy, every state $y\in S$ is accessible from z, i.e. $\mathbb{P}_{z}^f[T_y \lt \infty] \gt 0$ , $y\in S$ , $f\in \mathbb{F}$ .
(ii) Simultaneous Doeblin condition:

The first return time $T_{z}$ satisfies $\sup_{x\in S, f\in\mathbb{F}} \mathbb{E}_x^f[T_{z}] \lt\infty$ .

The proof of the following result can be found in [Reference Cavazos-Cadena7].

Lemma 1. Suppose that Assumptions 1 and 2 hold. Then:

(i) For every $y \in S$ , there exists a finite constant $M_{y}$ such that
(21) \begin{equation} \mathbb{E}_x^\pi[T_y] \le M_{y}, \qquad x\in S,\ \pi\in\mathcal{P}. \end{equation}
(ii) For arbitrary different states $x, y\in S$ , $\mathbb{P}_x^\pi[T_y \lt T_x] \gt 0$ , $\pi\in\mathcal{P}$ .

2.5. The problems

As already observed, under Assumptions 1 and 2, the optimal average cost cannot be generally characterized via the optimality equation (14), since it does not necessarily admit a solution. Thus, looking for a characterization of the optimal average cost is an interesting problem. The main objectives of this paper are:

to show that the optimal value functions $J^{\ast}({\cdot})$ and $J_*({\cdot})$ coincide and are constant;
to characterize the optimal average cost.

These problems were addressed in [Reference Cavazos-Cadena7] for the case $\lambda \gt 0$ of a risk-averse controller, so, from this point onwards, our analysis focuses on the risk-seeking case $\lambda \lt 0$ .

3. Main result

In this section the main conclusions of this work are stated in Theorem 1, which extends to the risk-seeking context results established in [Reference Cavazos-Cadena7] for the risk-averse case. The ideas in the following definition will be useful.

Definition 1.

(i) The set $\tilde{\mathcal{G}}$ of subsolutions of the optimality equation (14) for model $\mathcal{M}$ consists of all pairs $(g, h({\cdot}))$ , where $g \in\mathbb{R}$ and the function $h \colon S \to \mathbb{R}$ satisfy
(22) \begin{equation} U_{\lambda}(g + h(x)) \le \inf_{a\in A(x)}\Bigg[\sum_{y\in S}p_{x, y}(a)U_{\lambda}(C(x,a) + h(y))\Bigg], \qquad x\in S. \end{equation}
(ii) The set $\mathcal{G}$ is the projection of $\tilde{\mathcal{G}}$ on its first coordinate, i.e. $\mathcal{G} = \{g\in\mathbb{R} \mid (g, h) \in \tilde{\mathcal{G}}$ for some $h \colon \mathbb{R} \to \mathbb{R}\}$ .

Remark 3. When two cost functions are simultaneously being examined, the sets $\tilde{\mathcal{G}}$ and $\mathcal{G}$ in the above definition are denoted by $\tilde{\mathcal{G}}(C)$ and $\mathcal{G}(C)$ , respectively, making it clear what cost function is being considered. With this notation, it follows that, for each $\beta\in\mathbb{R}$ ,

$$(g, h({\cdot})) \in \tilde{\mathcal{G}}(C) \iff (g + \beta, h({\cdot})) \in \tilde{\mathcal{G}}(C+\beta),$$

whereas

(23)

\begin{equation} g\in \mathcal{G}(C)\iff g+\beta\in \mathcal{G}(C+\beta). \end{equation}

Via (3), it is not difficult to see that if $\lambda \lt 0$ , then inequality (22) is equivalent to

(24)

\begin{equation} \text{e}^{\lambda g+ \lambda h(x)} \ge \sup_{a\in A(x)} \Bigg[\text{e}^{\lambda C(x, a)}\sum_{y\in S}p_{x, y}(a)\text{e}^{ \lambda h(y)}\Bigg], \qquad x\in S.\end{equation}

Using the above notation, the main result of the paper can be stated as follows.

Theorem 1. Suppose that $\lambda \lt 0$ , and that Assumptions 1 and 2 hold. In this context, the following assertions are valid:

(i) The inclusion $g \in \mathcal{G}$ is equivalent to $g \le J_*({\cdot})$ .
(ii) For each $x\in S$ , $J_*(x) = \sup\{g \mid g \in \mathcal{G}\} = J^{\ast}(x)$ .
(iii) Set $g^{\ast} = \sup \{g \mid g \in \mathcal{G}\}$ . With this notation, $g^{\ast} \in \mathcal{G}$ . More explicitly, there exists $h^{\ast} \colon S \to \mathbb{R}$ such that $(g^{\ast},h^{\ast}({\cdot})) \in \tilde{\mathcal{G}}$ , that is,
(25) \begin{equation} \text{e}^{\lambda g^{\ast} + \lambda h^{\ast}(x)} \ge \sup_{a\in A(x)} \Bigg[\text{e}^{\lambda C(x,a)}\sum_{y\in S}p_{x, y}(a)\text{e}^{\lambda h^{\ast}(y)}\Bigg], \qquad x\in S. \end{equation}
(iv) For each $\varepsilon \gt 0$ , there exists a stationary policy $f\in \mathbb{F}$ which is $\varepsilon$ -optimal in the sense that $g^{\ast} \le J_-(f, x) \le J(f, x) \le g^{\ast} + \varepsilon$ , $x\in S$ .

The equality in Theorem 1(ii) is an extension of the Collatz–Wielandt formula for the largest eigenvalue of a positive (finite-dimensional) matrix ([Reference Meyer17]); see, for instance, the discussion on this point presented in Remark 3.2 of [Reference Cavazos-Cadena7]. The rather technical proof of the above result will be presented in Section 6. Roughly, the backbone of the argument consists in ‘approximating’ the optimal average cost of the original model $\mathcal{M}$ , via MDPs with finite spaces of stationary policies and cost functions with finite support. This strategy is developed in the following two sections, and can be briefly outlined as follows: Section 4 concerns basic properties of the class of subsolutions. First, it is shown that if $(g,h({\cdot}))\in\tilde{\mathcal{G}}$ then $h({\cdot})$ is bounded from above, and g is a lower bound for the optimal inferior limit average index; moreover, given that the system is driven by $\pi\in\mathcal{P}$ starting at $x\in S$ , it is shown that h(x) is bounded from above by the certainty equivalent of the total relative cost $\sum_{t=0}^{T_w -1 }(C(X_t, A_t)- g)$ up to the first return time to a given state w. Additionally, such a certainty equivalent is studied for the case in which the system evolves under a stationary policy, and the results in this direction are used in Section 5 to show that, if the space of stationary policies is finite and C has finite support, then the optimality equation (14) admits a solution $(g, h({\cdot}))$ where the mapping $h({\cdot})$ is bounded. After these preliminaries, the proof of Theorem 1 is finally presented in Section 6 before the concluding remarks.

4. Fundamental tools

This section presents auxiliary results that will be used to establish Theorem 1. The starting point is the following lemma, which concerns two basic properties of the pairs in $\tilde{\mathcal{G}}$ , namely, if $(g,h({\cdot})) \in \tilde{\mathcal{G}}$ then (i) the functional part h is bounded from above, and (ii) g is a lower bound of the optimal inferior limit average index.

Lemma 2. Given that $\lambda \lt 0$ , under Assumptions 1 and 2 the following properties hold for each $w\in S$ and $(g,h({\cdot})) \in \tilde{\mathcal{G}}$ :

(i) For each $x\in S$ and $\pi\in \mathcal{P}$ ,
(26) \begin{equation} \text{e}^{\lambda h(x) - \lambda h(w)} \ge \mathbb{E}_x^\pi \Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t, A_t) - g)\Bigg\}\Bigg], \end{equation}
and then
(27) \begin{equation} 1 \ge \mathbb{E}_w^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t, A_t) - g)\Bigg\}\Bigg]. \end{equation}
(ii) $h({\cdot}) - h(w) \le \|C-g\| M_w$ , where $M_w$ is as in (21).
(iii) $g \le J_*(x)$ for each $x\in S$ .

Proof. For (i), let $\pi\in\mathcal{P}$ and $x\in S$ be arbitrary. Given $n\in \mathbb{N}$ , observe that, combining Definition 1 with (24), it follows that $\text{e}^{\lambda h(X_n)} \ge \sum_{y\in S}\text{e}^{\lambda(C(X_n, a)-g)}p_{X_n,y}(a)\text{e}^{\lambda h(y)}$ for every $a\in A(X_n)$ , and then

\begin{align*} \text{e}^{\lambda h(X_n)} & \ge \sum_{y\in S}\int_{A(X_n)}\text{e}^{\lambda(C(X_n,a)-g)}p_{x,y}(a)\,\pi_n(\text{d} a \mid H_n)\text{e}^{\lambda h(y)} \\ & = \mathbb{E}_x^\pi\big[\text{e}^{\lambda(C(X_n,A_n)-g) + \lambda h(X_{n+1})} \mid \mathcal{F}_n\big], \end{align*}

where the equality is due to the Markov property. Now set $Z_0 = e^{\lambda h(X_0)}$ , $Z_n = \exp\big\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t) - g) + \lambda h(X_n)\big\}$ , $n= 1,2,3,\ldots$ Using these and the previous display, and the fact that $\exp\big\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t) - g)\big\}$ is $\mathcal{F}_n$ -measurable, an argument along the lines of the one used in Remark 2 yields that $\{(Z_n,\mathcal{F}_n)\}$ is a supermartingale with respect to $\mathbb{P}_x^\pi$ , i.e. $Z_n \ge \mathbb{E}_x^\pi[Z_{n+1} \mid \mathcal{F}_n]$ , $n\in\mathbb{N}$ , $\mathbb{P}_x^\pi$ -almost surely (a.s.). Therefore, since $T_w$ is a stopping time with respect to the filtration $\{\mathcal{F}_n\}$ , the optional sampling theorem yields $\mathbb{E}_x^\pi[Z_0] \ge \mathbb{E}_x^\pi[Z_{n\land T_w}]$ (Theorem 7.7.3, p. 304 of [Reference Ash1]; Theorem 35.2, p. 405 of [Reference Billingsley4]), i.e.

(28)

\begin{equation} \text{e}^{\lambda h(x)} \ge \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{n \land T_w - 1}(C(X_t,A_t) - g) + \lambda h(X_{n \land T_w})\Bigg\}\Bigg]. \end{equation}

Now observe that $\mathbb{P}_x^\pi[T_w \lt \infty] = 1$ , by (21), and that $X_{T_w} = w$ on the event $[T_w \lt\infty]$ , by (19). Therefore, with probability 1 with respect to $\mathbb{P}_x^\pi$ ,

\begin{align*} & \lim_{n\to\infty}\exp \Bigg\{\lambda\sum_{t=0}^{n \land T_w - 1}(C(X_t,A_t) - g) + \lambda h(X_{n \land T_w})\Bigg\} \\ & = \exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t) - g) + \lambda h(X_{T_w})\Bigg\} = \exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t) - g) + \lambda h(w)\Bigg\}. \end{align*}

Thus, via Fatou’s lemma, (28) yields

\begin{align*} \text{e}^{\lambda h(x)} & \ge \liminf_{n\to\infty}\mathbb{E}_x^\pi\Bigg[ \exp\Bigg\{\lambda\sum_{t=0}^{n \land T_w - 1}(C(X_t,A_t) - g) + \lambda h(X_{n \land T_w})\Bigg\}\Bigg] \\ & \ge \mathbb{E}_x^\pi\Bigg[ \exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t,A_t) - g) + \lambda h(w)\Bigg\}\Bigg] \\ & = \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t,A_t) - g)\Bigg\}\Bigg]\text{e}^{\lambda h(w)}, \end{align*}

and (26) follows; setting x equal to w leads to (27).

For (ii), from Jensen’s inequality it follows that

(29)

\begin{align} \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t,A_t) - g)\Bigg\}\Bigg] & \ge \exp\Bigg\{\mathbb{E}_x^\pi\Bigg[\lambda\sum_{t=0}^{T_w - 1}(C(X_t,A_t) - g)\Bigg]\Bigg\} \nonumber \\ & \ge \text{e}^{-\|\lambda(C - g)\|\mathbb{E}_x^\pi[T_w]} \nonumber \\ & \ge \text{e}^{\lambda\|C - g\|M_w}, \qquad x\in S, \ \pi\in\mathcal{P} , \end{align}

where the negativity of $\lambda$ and (21) were used in the last step. The above relation and (26) together imply that $\lambda h({\cdot}) - \lambda h(w) \ge \lambda\|C - g\|M_w$ , and the conclusion follows.

For (iii), let $x\in S$ and $\pi\in\mathcal{P}$ be arbitrary. Recalling that $\{(Z_n,\mathcal{F}_n)\}$ is a supermartingale with respect to $\mathbb{P}_x^\pi$ , for every positive integer n,

\begin{align*} \text{e}^{\lambda h(x)} = \mathbb{E}_x^\pi[Z_0] & \ge \mathbb{E}_x^\pi[Z_n] \\ & = \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t)-g) + \lambda h(X_n)\Bigg\}\Bigg] \\ & \ge \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t)-g)\Bigg\}\Bigg] \text{e}^{\lambda(h(w) + \|C - g\|M_w)} \\ & = \mathbb{E}_x^\pi\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{n-1}C(X_t,A_t)\Bigg\}\Bigg] \text{e}^{-n\lambda g + \lambda(h(w) + \|C - g\|M_w)} \\ & = \exp\{\lambda J_n(\pi, x) - n\lambda g + \lambda( h(w) + \|C- g\|M_w)\} , \end{align*}

where the second inequality is due to part (ii), and (7) was used in the last step. It follows that $\lambda h(x) \ge \lambda J_n(\pi, x) - n\lambda g + \lambda(h(w) + \|C - g\|M_w)$ , and then

$$g \le \frac{1}{n} J_n(\pi, x) + \frac{1}{n}(h(w) - h(x) + \|C - g\|M_w).$$

Taking the inferior limit as n goes to $\infty$ in both sides of this relation, it follows that $g \le J_-(\pi, x)$ , and then $g \le J_*({\cdot})$ , since $x\in S$ and $\pi\in\mathcal{P}$ are arbitrary in this argument.

The remainder of the section is dedicated to studying inequality (27) for a stationary policy f, and it is convenient to introduce additional notation.

Definition 2. Let $f\in \mathbb{F}$ , $g\in\mathbb{R}$ , and $w\in S$ be arbitrary but fixed, and define the total relative cost function up to the first return time to state w under f by

$$h_{f,g,w}(x) \,:\!=\, \frac{1}{\lambda}\log\Bigg(\mathbb{E}_x^f\Bigg[ \exp\Bigg\{\lambda\sum_{t=0}^{T_w - 1}(C(X_t,A_t) - g)\Bigg\}\Bigg]\Bigg), \qquad x\in S.$$

Notice that this definition and (29) with f instead of $\pi$ together yield $\lambda h_{f,g,w}(x) \ge \lambda\|C-g\|M_w$ , and the negativity of $\lambda$ leads to

(30)

\begin{equation} h_{f, g, w}({\cdot}) \le \|C-g\| M_w.\end{equation}

On the other hand, a conditional argument combining Definition 2 with the Markov property yields

\begin{align*} & \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\mathbf{1}[X_1\ne w]\mid H_1\Bigg] \\ & = \text{e}^{\lambda(C(X_0,f(X_0)) - g)}\mathbf{1}[X_1\ne w] \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=1}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\mathbf{1}[X_1\ne w]\mid H_1\Bigg] \\ & = \text{e}^{\lambda(C(X_0,f(X_0)) - g)}\mathbf{1}[X_1\ne w] \mathbb{E}_{X_1}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\}\Bigg] \\ & = \text{e}^{\lambda(C(X_0,f(X_0)) - g)}\mathbf{1}[X_1\ne w]\text{e}^{\lambda h_{f, g, w}(X_1)},\end{align*}

and then

(31)

\begin{align} \text{e}^{\lambda h_{f,g,w}(x) } & = \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\mathbf{1}[X_1= w]\Bigg] \nonumber \\ & \quad + \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\mathbf{1}[X_1\ne w]\Bigg] \nonumber \\ & = \text{e}^{\lambda(C(x,f(x))-g)}\Bigg(p_{x, w}(f(x)) + \sum_{y\ne w}p_{x,y}(f(x))\text{e}^{\lambda h_{f,g,w}(y)}\Bigg).\end{align}

Next, the following lemma shows that if $h_{f,g,w}(w)$ is finite at some point, then $h_{f,g,w}({\cdot})$ is finite on S, and a fundmental property of the mapping $y \mapsto h_{f, g, y}(y)$ will be established, namely, if that function attains a non-negative value at some point, then it is always non-negative.

Lemma 3. Assume that $\lambda \lt 0$ and that Assumptions 1 and 2 are valid. In this context, the following assertions hold for arbitrary $f \in \mathbb{F}$ , $g \in \mathbb{R}$ , and $w\in S$ .

(i) If $h_{f, g, w}(w) \gt -\infty$ , then $h_{f,g,w}(x) $ is finite for every $x\in S$ .
(ii) The following assertions are equivalent:
1. (a) $h_{f, g, w}(w) \ge 0$ .
2. (b) $h_{f, g, w}({\cdot})$ is finite and the following Poisson inequality holds:
  (32) \begin{equation} \text{e}^{\lambda h_{f,g,w}(x)} \ge \text{e}^{\lambda(C(x,f(x))-g)}\sum_{y\in S}p_{x,y}(f(x))\text{e}^{\lambda h_{f,g,w}(y)}, \qquad x\in S. \end{equation}
3. (c) $h_{f, g, y}(y) \ge 0$ for every $y\in S$ .

Proof. For (i), given $x\in S\setminus \{w\}$ , using Lemma 1(ii) pick a positive integer k satisfying

(33)

\begin{equation} \mathbb{P}_w[T_x = k \lt T_w] \gt 0, \end{equation}

and notice that $[T_x = k \lt T_w] = [X_t \notin \{x, w\}, 1\le t \lt k, X_k = x]$ belongs to $\mathcal{F}_k$ (see (2)), so that

\begin{align*} & \mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\} \mid \mathcal{F}_k\Bigg] \\ & \ge \mathbb{E}_w^f\Bigg[\mathbf{1}[T_x = k \lt T_w] \exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\} \mid \mathcal{F}_k\Bigg] \\ & = \exp\Bigg\{\lambda\sum_{t=0}^{k-1}(C(X_t,A_t)-g)\Bigg\}\mathbf{1}[T_x = k \lt T_w] \mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=k}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\mid\mathcal{F}_k\Bigg] \\ & \ge \text{e}^{\lambda k\|C - g\|}\mathbf{1}[T_x = k \lt T_w] \mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=k}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\mid\mathcal{F}_k\Bigg] \\ & = \text{e}^{\lambda k\|C - g\|}\mathbf{1}[T_x = k \lt T_w] \mathbb{E}_{X_k}^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\Bigg], \end{align*}

where the last equality is due to the Markov property. Since $X_k = x$ on the event $[T_x = k]$ , it follows from Definition 2 that

\begin{align*} & \mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\mid\mathcal{F}_k\Bigg] \\ & \ge \text{e}^{\lambda k\|C - g\|}\mathbf{1}[T_x = k \lt T_w] \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\Bigg] \\ & = \text{e}^{\lambda k\|C - g\|}\mathbf{1}[T_x = k \lt T_w]\text{e}^{\lambda h_{f,g,w}(x)}, \end{align*}

so that

$$\text{e}^{\lambda h_{f, g, w}(w)} = \mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\Bigg] \ge \text{e}^{\lambda k\|C - g\|}\text{e}^{\lambda h_{f,g,w}(x)}\mathbb{P}_w^f[T_x = k \lt T_w],$$

and then (33) yields

$$h_{f,g,w}(w) \gt -\infty \implies \lambda h_{f,g,w}(w) \lt\infty \implies \lambda h_{f, g, w}(x) \lt\infty \implies h_{f, g, w}(x) \gt -\infty.$$

Since $x\in S\setminus \{w\}$ is arbitrary in this argument, this last display and (30) together yield that if $h_{f,g,w}(w)$ is finite then $h_{f,g,w}(x)$ is finite for every $x\in S$ .

For (ii) (c) $\implies$ (a), the implication is clear. For (a) $\implies$ (b), suppose that $h_{f,g,w}(w) \ge 0$ , so that $h_{f,g,w}({\cdot}) $ is a finite function, by part (i), and $1 \ge \text{e}^{\lambda h_{f,g,w}(w)}$ , since $\lambda$ is negative. Thus, from (31),

$$\text{e}^{\lambda h_{f,g,w}(x)} \ge \text{e}^{\lambda(C(x,f(x))-g)}\Bigg(p_{x,w}(f(x))\text{e}^{\lambda h_{f,g,w}(w)} + \sum_{y\ne w}p_{x,y}(f(x)) \text{e}^{\lambda h_{f,g,w}(y)}\Bigg), \quad x\in S,$$

and (32) follows. For (b) $\implies$ (c), let $y\in S$ be arbitrary but fixed. Consider the new MDP $\mathcal{M}_f$ obtained from $\mathcal{M}$ by setting $ A(x) = \{f(x)\}$ . In this case f is the unique policy in this new model, and (32) establishes that the pair $(g,h_{f,g,w}({\cdot}))$ belongs to the set $\tilde{\mathcal{G}}(\mathcal{M}_f)$ of subsolutions corresponding to $\mathcal{M}_f$ , so that the conclusions of Lemma 2(i) applied to model $\mathcal{M}_f$ with y instead of w hold. In particular, inequality (27) with f instead of $\pi$ is valid, and then $1 \ge \mathbb{E}_y^f\big[\exp\big\{\lambda\sum_{t=0}^{T_y-1}(C(X_t,A_t)-g)\big\}\big] = \text{e}^{\lambda h_{f,g,y}(y)}$ , so that $h_{f,g,y}(y)\ge 0$ .

The next result is a natural complement of Lemma 3(ii).

Lemma 4. Let $f\in \mathbb{F}$ , $g\in\mathbb{R}$ , and $w\in S$ be arbitrary but fixed. Under the conditions in Lemma 3, the following assertions hold.

(i) $h_{f,g,w}(w) \gt 0 \iff h_{f,g,y}(y) \gt 0$ for every $y\in S$ .
(ii) $h_{f,g,w}(w) = 0 \iff h_{f,g,y}(y) = 0$ for every $y\in S$ .

Proof. For (i), the $\Longleftarrow$ part is clear. To prove the $\implies$ , suppose that $h_{f,g,w}(w) \gt 0$ . In this case $h_{f,g,w}({\cdot})$ is finite by Lemma 3, and, recalling that $\lambda$ is negative, $1 \gt \text{e}^{\lambda h_{f,g,w}(w)} = \mathbb{E}_w^f\big[\exp\big\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\big\}\big]$ . Thus,

(34)

\begin{equation} 1 = \text{e}^{\lambda\rho}\mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g)\Bigg\}\Bigg] \quad \hbox{for some } \rho \lt 0. \end{equation}

Next, define the cost function $\hat {C}$ by $\hat C(x,\cdot) = C(x,\cdot)$ , $x\ne w$ , $\hat{C}(w,\cdot) = C(w,\cdot) + \rho$ , and consider the new model $\hat{\mathcal{M}}$ obtained from $\mathcal{M}$ by replacing C by $\hat{C}$ . In this case, if $\hat{h}_{f,g,w}$ is the total relative cost in Definition 2 associated with model $\hat{\mathcal{M}}$ , (34) yields that $\text{e}^{\lambda\hat{h}_{f,g,w}(w)} = 1$ , i.e. $\hat{h}_{f,g,w}(w) = 0$ , and then the equivalence of assertions (a) and (c) in Lemma 3(ii) applied to model $\hat{\mathcal{M}}$ implies that

(35)

\begin{equation} \hat{h}_{f,g,y}(y) \ge 0 \quad \hbox{for every state } y. \end{equation}

Next, combining this fact with the inequality $h_{f,g,w}(w) \gt 0$ , we show that $h_{f,g,y}(y) \gt 0$ for $y\in S$ . Since $h_{f,g,w}(w) \gt 0$ , to achieve this goal it is sufficient to establish the above inequality for $y\ne w$ . In this context, observe that (35) yields

\begin{align*} 1 \ge \text{e}^{\lambda\hat h_{f,g,y}(y)} & = \mathbb{E}_y^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_y-1}(\hat C(X_t,A_t)-g)\Bigg\}\Bigg] \\ & = \mathbb{E}_y^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_y-1}(C(X_t,A_t)-g)\Bigg\} \exp\Bigg\{\lambda\rho\sum_{t=0}^{T_y-1}\mathbf{1}_{\{w\}}(X_t)\Bigg\}\Bigg] \\ & \gt \mathbb{E}_y^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_y-1}(C(X_t,A_t)-g)\Bigg\}\Bigg] = \text{e}^{\lambda h_{f,g,y}(y)} , \end{align*}

where the strict inequality follows by observing that (a) $\exp\big\{\lambda\rho\sum_{t=0}^{T_y-1}\mathbf{1}_{\{w\}}(X_t)\big\} \ge 1$ , since $\lambda$ and $\rho$ are negative, and that (b) $\exp\big\{\lambda\rho\sum_{t=0}^{T_y-1}\mathbf{1}_{\{w\}}(X_t)\big\} \gt 1$ on the event $[T_w \lt T_y]$ , which has positive probability with respect to $\mathbb{P}_y^f$ by Lemma 1(ii), whereas the last equality is due to Definition 2. Thus, $1 \gt \text{e}^{\lambda h_{f,g,y}(y)}$ , and then $h_{f,g,y}(y) \gt 0$ , since $\lambda$ is negative. This completes the proof of part (i).

For (ii), the assertion follows by combining part (i) with the equivalence of properties (a) and (c) in Lemma 3(ii).

5. Optimality equation in a special case

The objective of this section is to provide sufficient conditions to ensure the existence of a bounded solution to the optimality equation (14). The result in this direction requires special conditions on the cost function and the action sets, and is stated in Theorem 2. That result will be established using the following lemma.

Lemma 5. Let $\lambda \lt 0$ and $f\in \mathbb{F}$ be arbitrary but fixed. Suppose that the cost function $C(\cdot, f({\cdot}))$ under f satisfies the following requirements:

(a) $C(x,f(x)) \le 0$ for every $x\in S$ ;
(b) $C(\cdot,f({\cdot}))$ has finite support, i.e.
(36) \begin{equation} F \,:\!=\, \{x \in S \mid C(x,f(x)) \lt 0\}\ { is\ finite}. \end{equation}

In this case there exists a real number g(f) such that the following properties are valid:

(i) $g(f)\le 0$ , and the functions $\{h_{f,g(f),w}\}_{w \in S}$ satisfy
(37) \begin{equation} \text{e}^{\lambda h_{f,g(f),w}(w)} = \mathbb{E}_w^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f))\Bigg\}\Bigg] = 1, \qquad w\in S; \end{equation}
see Definition 2.
(ii) For each $w\in S$ , the function $h_{f,g(f),w}({\cdot})$ is finite and the following Poisson equation holds:
(38) \begin{equation} \text{e}^{\lambda h_{f,g(f),w}(x)} = \text{e}^{\lambda(C(x,f(x))-g(f))}\sum_y p_{x,y}(f(x))\text{e}^{\lambda h_{f,g(f),w}(y)}, \qquad x\in S. \end{equation}
(iii) The function $h_{f,g(f),w}({\cdot})$ is bounded for each $w\in S$ . More explicitly,
$$-\infty \lt\inf_{x\in S}h_{f,g(f),w}(x) \le h_{f,g(f),w}({\cdot}) \le M_w\|C - g(f)\|.$$
(iv) Given $w\in S$ , for $n\in \mathbb{N}$ set
(39) \begin{equation} W_n(f) \,:\!=\, \exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f)) + \lambda h_{f,g,w}(X_{n\land T_w})\Bigg\}. \end{equation}
With this notation, $W_n(f) = \mathbb{E}_x^f\big[\exp\big\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f))\big\} \mid \mathcal{F}_n\big]$ , $\mathbb{P}_x^f$ -a.s., $x \in S$ , $n\in\mathbb{N}$ ; see (2), (19), and (20).

Proof. Recall that the equality $A_t = f(X_t)$ is always valid when the system is driven using f.

The argument to prove (i) is by induction on $|F|$ , the number of elements of the set F in (36). If $|F| = 0$ then $C(\cdot,f({\cdot})) = 0$ , so that, setting $g(f) = 0$ , it follows that (37) holds for each $x\in S$ . Next, assume that, for some nonnegative integer n, the desired conclusion is valid when the support of $C(\cdot,f({\cdot}))$ has n elements, and suppose that $|F| = n+1$ . Pick $\tilde x \in F$ , so that

(40)

\begin{equation} C(\tilde x, f(\tilde x )) \lt 0, \end{equation}

and set $\tilde{F} = F \setminus \{\tilde{x}\}$ , $\tilde{C}(x,a) = C(x,a)\mathbf{1}_{\tilde{F}}(x)$ , $(x,a) \in \mathbb{K}$ , so that

(41)

\begin{equation} F = \tilde{F} \cup \{\tilde{x}\}, \quad C(x,f(x)) = C(\tilde x,f(\tilde x))\mathbf{1}_{\{\tilde x\}}(x) + \tilde C(x,f(x)), \qquad x\in S. \end{equation}

It follows that the support of the function $\tilde{C}(\cdot,f({\cdot}))$ is the set $\tilde{F}$ , which has n elements. Thus, by the induction hypothesis there exists $\tilde{g}(f)\le 0$ such that (37) holds with $\tilde{C}$ and $\tilde{g}(f)$ instead of C and g(f), respectively; in particular,

\begin{equation*} \mathbb{E}_{\tilde x}^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(\tilde C(X_t,A_t) - \tilde g(f))\Bigg\}\Bigg] = 1. \end{equation*}

Next, observe that (41) yields

\begin{align*} \sum_{t=0}^{T_{\tilde x}-1}(C(X_t,f(X_t)) - \tilde g(f)) & = C(\tilde x,f(\tilde x))\sum_{t=0}^{T_{\tilde x}-1}\mathbf{1}_{\{\tilde x\}}(X_t) + \sum_{t=0}^{T_{\tilde x}-1}(\tilde C(X_t,f(X_t)) - \tilde g(f))) \\ & = C(\tilde x,f(\tilde x))\mathbf{1}_{\{\tilde x\}}(X_0) + \sum_{t=0}^{T_{\tilde x}-1}(\tilde{C}(X_t,f(X_t)) - \tilde g(f))) \\ & = C(\tilde x,f(\tilde x)) + \sum_{t=0}^{T_{\tilde x}-1}(\tilde{C}(X_t,f(X_t)) - \tilde g(f))) \qquad \mathbb{P}_{\tilde x}^f\hbox{-a.s.}, \end{align*}

where the second equality was obtained using that $X_t \ne \tilde x$ for $1 \le t \lt T_{\tilde x}$ , by (19) and (20), whereas the relation $\mathbb{P}_{\tilde x}^f[X_0 = \tilde x] = 1$ was used in the last step. Thus, (37) leads to

(42)

\begin{equation} \mathbb{E}_{\tilde x}^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t) - \tilde g(f))\Bigg\}\Bigg] = \text{e}^{\lambda C(\tilde x,f(\tilde x))} \gt 1, \end{equation}

where, recalling that $\lambda \lt 0$ , the inequailty is due to (40). Now observe that

$$\exp\Bigg\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t)-\gamma)\Bigg\} \searrow 0 \ \hbox{as } \gamma \searrow -\infty,$$

whereas, since the mapping $\gamma \mapsto \exp\big\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t)-\gamma)\big\}$ is increasing,

$$ \exp\Bigg\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t) - \gamma)\Bigg\} \le \exp\Bigg\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t) - \tilde g(f))\Bigg\}, \qquad \gamma \le \tilde g(f). $$

Combining (42) with the two previous displays, the dominated convergence theorem implies that $v(\gamma) \,:\!=\, \mathbb{E}_{\tilde x}^f\big[\exp\big\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t) - \gamma)\big\}\big]$ is continuous in $\gamma \in ({-}\infty,\tilde g(f)]$ , and that $v(\gamma) \searrow 0$ as $\gamma \searrow -\infty$ ; since $v(\tilde g(f)) = \text{e}^{\lambda C(\tilde x,f(\tilde x))} \gt 1$ , by (42), the intermediate value property implies that there exists $g(f) \in ({-}\infty,\tilde g(f)) \subset({-}\infty,0)$ such that

$$ 1 = v(g(f)) = \mathbb{E}_{\tilde x}^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_{\tilde x}-1}(C(X_t,A_t) - g(f))\Bigg\}\Bigg] = \text{e}^{\lambda h_{f,g(f),\tilde x}(\tilde x)}. $$

From this point, Lemma 4(ii) yields that (37) holds for every $x\in S$ . This completes the induction argument.

For (ii), since $\text{e}^{\lambda h_{f,g(f),w}(w)} = 1$ for every $w\in S$ , by part (i), Lemma 3(i) yields that $h_{f,g(f),w}({\cdot})$ is finite, and (38) follows from (31).

For (iii), recall that the equality $\mathbb{P}_x^\pi[T_w \lt\infty] = 1$ is always valid, by Lemma 1(i). Now, define the finite set $G = \{w\}\cup F$ such that $T_G\le T_w$ and $T_G \le T_F$ , and then

(43)

\begin{equation} C(X_t, f(X_t)) = 0, \qquad 1\le t \lt T_G. \end{equation}

Next, select an initial state

(44)

\begin{equation} X_0 = x\in (S\setminus G)\subset (S\setminus F), \end{equation}

so that

(45)

\begin{equation} C(X_0, f(X_0)) = 0, \end{equation}

and observe that

\begin{align*} & \exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,f(X_t)) - g(f))\Bigg\} \\ & = \mathbf{1}[T_G = T_w]\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,f(X_t)) - g(f))\Bigg\} \\ & \quad + \mathbf{1}[T_G \lt T_w]\exp\Bigg\{\lambda\sum_{t=0}^{T_{w}-1}(C(X_t,f(X_t)) - g(f))\Bigg\} \\ & = \mathbf{1}[T_G = T_w]\exp\Bigg\{\lambda\sum_{t=0}^{T_G-1}(C(X_t,f(X_t)) - g(f))\Bigg\} \\ & \quad + \mathbf{1}[T_G \lt T_w]\exp\Bigg\{\lambda\Bigg(\sum_{t=0}^{T_{G}-1}(C(X_t,f(X_t)) - g(f)) + \sum_{t=T_G}^{T_{w}-1}(C(X_t,f(X_t)) - g(f))\Bigg)\Bigg\} \\ & = \mathbf{1}[T_G = T_w]\text{e}^{-\lambda g(f)T_G} + \mathbf{1}[T_G \lt T_w]\exp\Bigg\{{-}\lambda g(f)T_G + \lambda\sum_{t=T_G}^{T_{w}-1}(C(X_t,f(X_t)) - g(f))\Bigg\} \\ & \le \mathbf{1}[T_G = T_w] + \mathbf{1}[T_G \lt T_w]\exp\Bigg\{\lambda\sum_{t=T_G}^{T_{w}-1}(C(X_t,f(X_t)) - g(f))\Bigg\}, \end{align*}

where the third equality is due to (43)–(45), and the inequality stems from the negativity of $\lambda$ and g(f). It follows that, for every $x\in S\setminus G$ ,

\begin{align*} \text{e}^{\lambda h_{f,g(f),w}(x)} & = \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,f(X_t)) - g(f))\Bigg\}\Bigg] \\ & \le \mathbb{P}_x^f[T_G=T_w] + \mathbb{E}_x^f\Bigg[\mathbf{1}[T_G \lt T_w]\exp\Bigg\{\lambda\sum_{t=T_G}^{T_{w}-1}(C(X_t,f(X_t))-g(f))\Bigg\}\Bigg] \\ & = \mathbb{P}_x^f[T_G \lt T_w] + \mathbb{E}_x^f\big[\mathbf{1}[T_G \lt T_w]\text{e}^{\lambda h_{f,g(f),w}(X_{T_G})}\big], \end{align*}

where, via a conditional argument, the second equality follows from the Markov property. Thus, since $X_{T_G} \in G$ on the event $[T_G \lt \infty]$ and $\mathbb{P}_x^f[T_G \lt \infty] \ge \mathbb{P}_x^f[T_w \lt \infty] = 1$ , it follows that

\begin{equation*} \text{e}^{\lambda h_{f,g(f),w}(x)} \le 1 + \exp\Big\{\max_{y \in G}|\lambda h_{f, g(f), w}(y)|\Big\} \le 2\exp\Big\{\max_{y \in G} |\lambda h_{f, g(f), w}(y)|\Big\}, \quad x\in S\setminus G, \end{equation*}

and then $h_{f,g(f),w}(x) \ge \lambda^{-1}[\log(2) + \max_{y \in G}|\lambda h_{f,g(f),w}(y)|]$ when $x\in S\setminus G$ ; since $h_{f,g(f),w}({\cdot})$ is a finite function and the set G is finite, it follows that $h_{f,g(f),w}({\cdot})$ is bounded from below.

For (iv), looking at (2), (19), and (20) we observe that, for each $x, w\in S$ and $n, k \in\mathbb{N}\setminus \{0\}$ with $k \le n$ , the random variables $\mathbf{1}[T_w = k]$ , $\sum_{t=0}^{k-1}(C(X_t,A_t) - g(f))$ , and $\mathbf{1}[T_w \gt n ]$ are $\mathcal{F}_n$ -measurable, and then

\begin{align*} & \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f))\Bigg\}\mid\mathcal{F}_n\Bigg] \\ & = \mathbb{E}_x^f\Bigg[\sum_{k=1}^{n}\mathbf{1}[T_w = k] \exp\Bigg\{\lambda\sum_{t=0}^{k-1}(C(X_t,A_t)-g(f))\Bigg\} \mid \mathcal{F}_n\Bigg] \\ & \quad + \mathbb{E}_x^f\Bigg[\mathbf{1}[T_w \gt n]\exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t)-g(f))\Bigg\} \exp\Bigg\{\lambda\sum_{t=n}^{T_w-1}(C(X_t,A_t)-g(f))\Bigg\} \mid \mathcal{F}_n\Bigg] \\ & = \sum_{k=1}^{n}\mathbf{1}[T_w = k]\exp\Bigg\{\lambda\sum_{t=0}^{k-1}(C(X_t,A_t)-g(f))\Bigg\} \\ & \quad + \mathbf{1}[T_w \gt n]\exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t)-g(f))\Bigg\} \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=n}^{T_w-1}(C(X_t,A_t)-g(f))\Bigg\}\mid\mathcal{F}_n\Bigg] \\ & = \mathbf{1}[T_w \le n]\exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f))\Bigg\} \\ & \quad + \mathbf{1}[T_w \gt n]\exp\Bigg\{\lambda\sum_{t=0}^{n-1}(C(X_t,A_t)-g(f))\Bigg\}\text{e}^{\lambda h_{f,g(f),w}(X_n)} \\ & = \mathbf{1}[T_w \le n]exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f))\Bigg\} \\ & \quad + \mathbf{1}[T_w \gt n]\exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f))\Bigg\} \text{e}^{\lambda h_{f,g(f),w}(X_{n\land T_w})}, \end{align*}

where the Markov property was used to set the third equality. Next, notice that on the event $[T_w \le n]$ the equality $X_{n\land T_w} = w$ holds, and in this case $\text{e}^{\lambda h_{f,g(f),w}(X_{n\land T_w})} = \text{e}^{\lambda h_{f,g(f),w}(w)} = 1$ , by part (i). Combining this fact with the previous display, it follows that

\begin{align*} & \mathbb{E}_x^f\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f))\Bigg\}\mid\mathcal{F}_n\Bigg] \\ & = \mathbf{1}[T_w \le n]\exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f))\Bigg\} \text{e}^{\lambda h_{f,g(f),w}(X_{n\land T_w})} \\ & \quad + \mathbf{1}[T_w \gt n]\exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f))\Bigg\} \text{e}^{\lambda h_{f,g(f),w}(X_{n\land T_w})}\\ & = \exp\Bigg\{\lambda\sum_{t=0}^{n\land T_w-1}(C(X_t,A_t)-g(f))\Bigg\} \text{e}^{\lambda h_{f,g(f),w}(X_{n\land T_w})} = W_n(f), \end{align*}

completing the proof.

Next, the previous lemma will be used to establish the main result of the section, which will be derived under the following conditions.

Assumption 3. There exists a finite set $F\subset S$ such that the cost function and the action sets satisfy the following properties.

(i) For each $x\in S\setminus F$ , the action set A(x) is a singleton, whereas A(x) is finite for $x\in F$ .
(ii) If $x\in S\setminus F$ , $C(x,\cdot) = 0$ and $C(x,\cdot) \le 0$ if $x\in F$ .

Theorem 2. Suppose that $\lambda \lt 0$ , and that Assumptions 2 and 3 are valid. In this context, for each $w\in S$ , the following assertions hold.

(i) For each stationary policy $\tilde f$ , the cost function $C(\cdot, \tilde f({\cdot}))$ satisfies conditions (a) and (b) in Lemma 5, and there exists $f\in\mathbb{F}$ such that the pair $(g(f),h_{f,g(f),w}({\cdot})) \in ({-}\infty, 0] \times \mathcal{B}(S)$ satisfies
\begin{align*} \text{e}^{\lambda g(f) + \lambda h_{f,g(f),w}(x)} & = \sup_{a\in A(x)}\Bigg[\text{e}^{\lambda C(x,a)}\sum_{y\in S}p_{x,y}(a)\text{e}^{\lambda h_{f,g(f),w}(y)}\Bigg], \cr & = \text{e}^{\lambda C(x,f(x))}\sum_{y\in S}p_{x,y}(f(x))\text{e}^{\lambda h_{f,g(f),w}(y)}, \qquad x\in S. \end{align*}
(ii) $J_*({\cdot}) = g(f) = J^{\ast}({\cdot})$ , and $g(f) = \lim_{n\to\infty}({1/n})J_n(f, x)$ , $x\in S$ .

Proof. Starting with part (i), notice that Assumption 3 implies that the space $\mathbb{F}$ of stationary policies is finite and that, for each $\tilde f\in\mathbb{F}$ , the nonpositive cost function $C(\cdot,\tilde f({\cdot}))$ has finite support, so that conditions (a) and (b) in Lemma 5 hold for every $\tilde f\in\mathbb{F}$ ; see (36). It follows that, for each $\tilde f \in \mathbb{F}$ , there exists $g(\tilde f)$ such that, for each state w, the pair $(g(\tilde f),h_{\tilde f,g(\tilde f),w}({\cdot}))\in({-}\infty,0]\times\mathcal{B}(S)$ satisfies the conclusions in Lemma 5. Pick $f\in \mathbb{F}$ such that

(46)

\begin{equation} g(f) = \min_{\tilde f\in\mathbb{F}}g(\tilde f). \end{equation}

Now let $w\in S$ be arbitrary but fixed, and note that (38) holds by Lemma 5(ii), so that

(47)

\begin{equation} \text{e}^{\lambda h_{f,g(f),w}(x)} \le \sup_{a\in A(x)}\Bigg[\text{e}^{\lambda (C(x,a)-g(f))} \sum_y p_{x,y}(a)\text{e}^{\lambda h_{f,g(f),w}(y)}\Bigg], \qquad x\in S. \end{equation}

Thus, to establish the first assertion it is sufficient to show that the equality holds in the above relation. To achieve this goal, recall that the space A(y) is finite for every $y\in S$ , which allows us to pick $f^{\ast}\in \mathbb{F}$ such that

(48)

\begin{multline} \text{e}^{\lambda(C(x,f^{\ast}(x))-g(f))}\sum_y p_{x,y}(f^{\ast}(x))\text{e}^{\lambda h_{f,g(f),w}(y)} \\ = \sup_{a\in A(x)}\Bigg[\text{e}^{\lambda(C(x,a)-g(f))}\sum_y p_{x,y}(a)\text{e}^{\lambda h_{f,g(f),w}(y)}\Bigg], \qquad x\in S. \end{multline}

Combining these two last displays, it follows that, for each state $x\in S$ ,

$$\text{e}^{\lambda h_{f,g(f),w}(x)} \le \text{e}^{\lambda(C(x,f^{\ast}(x))-g(f))}\sum_y p_{x,y}(f^{\ast}(x))\text{e}^{\lambda h_{f,g(f),w}(y)},$$

and then there exists a function $\Delta\colon S\to\mathbb{R}$ such that

(49)

\begin{equation} \Delta({\cdot}) \ge 0 \end{equation}

and

(50)

\begin{equation} \text{e}^{\lambda h_{f,g(f),w}(x)} = \text{e}^{-\Delta(x)+\lambda(C(x,f^{\ast}(x))-g(f))}\sum_y p_{x,y}(f^{\ast}(x)) \text{e}^{\lambda h_{f,g(f),w}(y)}, \qquad x\in S; \end{equation}

notice that $g(f) - g(f^{\ast})\le 0$ , by (46), and then

(51)

\begin{equation} \lambda(g(f) - g(f^{\ast}))\ge 0, \end{equation}

since $\lambda \lt 0$ . To complete the proof of part (i) it is sufficient to show that

(52)

\begin{equation} \Delta ({\cdot}) = 0, \end{equation}

since this equality combined with (48) and (50) yields that the equality occurs in (47). To establish (52), recall that the equality $A_n = f^{\ast}(X_n)$ always holds with probability 1 under $f^{\ast}$ , so that, via (50) and the Markov property, it follows that, for each $x\in S$ and $n\in\mathbb{N}$ ,

(53)

\begin{align} \text{e}^{\lambda h_{f,g(f),w}(X_n)} & = \text{e}^{-\Delta(X_n) + \lambda(C(X_n,f^{\ast}(X_n))-g(f))}\sum_y p_{X_n,y}(f^{\ast}(X_n))\text{e}^{\lambda h_{f,g(f),w}(y)} \nonumber \\ & = \text{e}^{-\Delta(X_n)+\lambda(C(X_n,A_n)-g(f))} \mathbb{E}_x^{f^{\ast}}\big[\text{e}^{\lambda h_{f,g(f),w}(X_{n+1})}\mid\mathcal{F}_n\big], \quad \mathbb{P}_x^{f^{\ast}}\hbox{-a.s.} \end{align}

Now define the positive random variables

(54)

\begin{equation} \begin{aligned} V_0 &= \text{e}^{\lambda h_{f,g(f),w}(X_0)}, \\ V_n & = \exp\Bigg\{\sum_{t=0}^{n-1}[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))] + \lambda h_{f,g(f),w}(X_n)\Bigg\}, \quad n\in\mathbb{N}. \end{aligned} \end{equation}

Observing that under $f^{\ast}$ the random variable $\sum_{t=0}^{n}[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))]$ is $\mathcal{F}_n$ -measurable, and using the Markov property, it follows that, for every $x\in S$ and $n\in\mathbb{N}$ ,

\begin{align*} & \mathbb{E}_x^{f^{\ast}}\Bigg[\exp\Bigg\{\sum_{t=0}^n[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))] + \lambda h_{f,g(f),w}(X_{n+1})\Bigg\} \mid \mathcal{F}_n\Bigg] \\ & = \exp\Bigg\{\sum_{t=0}^n[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))]\Bigg\} \mathbb{E}_x^{f^{\ast}}[\text{e}^{\lambda h_{f,g(f),w}(X_{n+1})} \mid \mathcal{F}_n] \\ & = \exp\Bigg\{\sum_{t=0}^{n-1}[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))]\Bigg\}\text{e}^{\lambda h_{f,g(f),w}(X_n)}, \end{align*}

where the last equality is due to (53). This last relation and (54) lead to $\mathbb{E}_x^{f^{\ast}}[V_{n+1}\mid\mathcal{F}_n] = V_n$ , i.e. $\{(V_n,\mathcal{F}_n)\}_{n\in\mathbb{N}}$ is a martingale with respect to $\mathbb{P}_x^{f^{\ast}}$ . Via the optional sampling theorem it follows that, for every $x\in S$ and $n\in\mathbb{N}$ ,

\begin{align*} \text{e}^{\lambda h_{f,g(f),w}(x)} & = \mathbb{E}_x^{f^{\ast}}[V_0] = \mathbb{E}_x^{f^{\ast}}[V_{n\land T_w}] \\ & = \mathbb{E}_x^{f^{\ast}}\Bigg[\exp\Bigg\{\sum_{t=0}^{n\land T_w-1}[\lambda(C(X_t,A_t)-g(f))-\Delta(X_t)] + \lambda h_{f,g(f),w}(X_{n\land T_w})\Bigg\}\Bigg]. \end{align*}

Recalling that $\mathbb{P}_x^{f^{\ast}}[T_w \lt \infty] = 1$ , by Lemma 1(i), and using that $X_{T_w} = w$ on the event $[T_w \lt \infty]$ , it follows that

(55)

\begin{align} \lim_{n\to\infty}V_{n\land T_w} & = \lim_{n\to\infty}\exp\Bigg\{\sum_{t=0}^{n\land T_w-1}[\lambda(C(X_t,A_t)-g(f))-\Delta(X_t)] + \lambda h_{f,g(f),w}(X_{n\land T_w})\Bigg\} \nonumber \\ & = \exp\Bigg\{\sum_{t=0}^{T_w-1}[\lambda(C(X_t,A_t)-g(f))-\Delta(X_t)] + \lambda h_{f,g(f),w}(w)\Bigg\}, \quad \mathbb{P}_x^{f^{\ast}}\hbox{-a.s.} \end{align}

Next, we compare $V_n$ with the random variable $W_n(f^{\ast})$ introduced in Lemma 5; see (39). Observe that

$$ \sum_{t=0}^{n\land T_w-1}[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))] = \sum_{t=0}^{n\land T_w-1}\lambda(C(X_t,A_t)-g(f^{\ast})) - Q_n, $$

where

\begin{equation*} Q_n \,:\!=\, (n\land T_{w})\lambda(g(f)-g(f^{\ast})) + \sum_{t=0}^{n\land T_w-1}\Delta(X_t) \ge 0, \end{equation*}

and the inequality is due to (49) and (51). Thus,

\begin{align*} 0 \le V_{n\land T_w} & = \exp\Bigg\{\sum_{t=0}^{n\land T_w-1}[{-}\Delta(X_t)+\lambda(C(X_t,A_t)-g(f))] + \lambda h_{f,g(f),w}(X_{n\land T_w})\Bigg\} \\ & = \exp\Bigg\{\sum_{t=0}^{n\land T_w-1}[\lambda(C(X_t,A_t)-g(f^{\ast}))] + \lambda h_{f,g(f),w}(X_{n\land T_w})\Bigg\}\text{e}^{-Q_n} \\ & \le \exp\Bigg\{\sum_{t=0}^{n\land T_w-1}[\lambda(C(X_t,A_t)-g(f^{\ast}))] + \lambda h_{f,g(f),w}(X_{n\land T_w})\Bigg\} \\ & \le \exp\Bigg\{\sum_{t=0}^{n\land T_w-1}[\lambda(C(X_t,A_t)-g(f^{\ast}))] + \lambda h_{f^{\ast},g(f^{\ast}),w}(X_{n\land T_w})\Bigg\}\text{e}^{|\lambda|\,\|h_{f,g(f),w}-h_{f^{\ast},g(f^{\ast}),w}\|} \\ & = W_n(f^{\ast})\text{e}^{|\lambda|\,\|h_{f,g(f),w}-h_{f^{\ast},g(f^{\ast}),w}\|}, \end{align*}

where $W_n(f^{\ast})$ is as in (39). Since the equality

$$ W_n(f^{\ast}) = \mathbb{E}_x^{f^{\ast}}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f^{\ast}))\Bigg\} \mid \mathcal{F}_n\Bigg] $$

holds $\mathbb{P}_x^{f^{\ast}}$ -a.s. for every $x\in S$ and $n\in\mathbb{N}$ , by Lemma 5(iv), it follows that $\{W_n(f^{\ast})\}_{n\in\mathbb{N}}$ is uniformly integrable with respect to $\mathbb{P}_x^{f^{\ast}}$ , i.e. $\mathbb{E}_x^{f^{\ast}}[W_n(f^{\ast})\mathbf{1}[W_n(f^{\ast}) \gt c]] \to 0$ as $c\to\infty$ uniformly in n; note that $W_n(f ^{\ast}) \ge 0$ , and see, for instance, Theorem 7.6.6, p. 300 of [Reference Ash1], or Theorem 16.14, p. 230 of [Reference Billingsley4]. Since $\|h_{f,g(f),w}-h_{f^{\ast},g(f^{\ast}),w}\|$ is finite, by Lemma 5(iii), the relation $0 \le V_{n\land T_w} \le W_n(f^{\ast})\text{e}^{|\lambda|\,\|h_{f,g(f),w}-h_{f^{\ast},g(f^{\ast}),w}\|}$ obtained from the above display immediately implies that $\{V_{n\land T_w}\}_{n\in\mathbb{N}}$ is uniformly integrable with respect to $\mathbb{P}_x^{f^{\ast}}$ . Combining this property with (55), an application of Theorem 7.5.2, p. 295 in [Reference Ash1] allows us to interchange the limit and the expectation to obtain that, for every $x\in S$ ,

\begin{align*} \text{e}^{\lambda h_{f,g,w}(x)} & = \lim_{n\to\infty}\mathbb{E}_x^{f^{\ast}}[V_{n\land T_w}] = \mathbb{E}_x^{f^{\ast}}\Big[\lim_{n\to\infty}V_{n\land T_w}\Big] \cr & = \mathbb{E}_x^{f^{\ast}}\Bigg[\exp\Bigg\{\sum_{t=0}^{T_w-1}[\lambda(C(X_t,A_t)-g(f)) - \Delta(X_t)] + \lambda h_{f,g(f),w}(w)\Bigg\}\Bigg]. \end{align*}

Setting $x = w$ , it follows that

$$ \text{e}^{\lambda h_{f,g(f),w}(w)} = \mathbb{E}_w^{f^{\ast}}\Bigg[\exp\Bigg\{\sum_{t=0}^{T_w-1}[\lambda(C(X_t,A_t)-g(f)) - \Delta(X_t)] + \lambda h_{f,g(f),w}(w)\Bigg\}\Bigg] $$

and then

\begin{align*} 1 &= \mathbb{E}_w^{f^{\ast}}\Bigg[\exp\Bigg\{\sum_{t=0}^{T_w-1}[\lambda(C(X_t,A_t)-g(f))-\Delta(X_t)]\Bigg\}\Bigg] \\ & = \mathbb{E}_w^{f^{\ast}}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f^{\ast}))\Bigg\}\text{e}^{-Q}\Bigg], \end{align*}

where $Q \,:\!=\, T_{w}\lambda(g(f) - g(f^{\ast})) + \sum_{t=0}^{T_w-1}\Delta(X_t) \ge 0$ , and the inequality is due to (49) and (51). Observe now that $1 = \mathbb{E}_w^{f^{\ast}}\big[\exp\big\{\lambda\sum_{t=0}^{T_w-1}(C(X_t,A_t)-g(f^{\ast}))\big\}\big]$ by Lemma 5(i), and combining this equality with the two previous expressions it follows that $1 = \mathbb{P}_w^{f^{\ast}}[Q = 0]$ , an equality that, combined with (49) and (51), implies that $\mathbb{P}_w^{f^{\ast}}\big[\sum_{t=0}^{T_w-1}\Delta(X_t) = 0\big]= 1$ .

To conclude, observe that $\sum_{t=0}^{T_w-1}\Delta(X_t) \ge \Delta(X_0) \ge 0$ and $\sum_{t=0}^{T_w-1}\Delta(X_t) \ge \Delta(x)\mathbf{1}[T_x \lt T_w] \ge 0$ , $x \in S\setminus \{w\}$ , and recall that $\mathbb{P}_w^{f^{\ast}}[T_x \lt T_w] \gt 0$ by Lemma 1(ii), whereas $1 = \mathbb{P}_w[X_0= w]$ . Combining these relations with the two previous expressions, it follows that $\Delta({\cdot}) = 0$ . As already mentioned, this completes the proof of part (i). From this point, part (ii) follows via Remark 2(iii).

6. Proof of Theorem 1

In this section the main result of the paper is established. Essentially, the approach used to prove Theorem 1 consists in ‘approximating’ the original model $\mathcal{M}$ by a sequence $\{\mathcal{M}_n\}$ of MDPs satisfying the conditions in Theorem 2. Throughout the remainder of the section Assumptions 1 and 2, as well as the condition $\lambda \lt 0$ , are enforced. The argument relies heavily on the preliminaries established in Section 5, as well as on Lemma 6. To begin with, suppose that the cost function C satisfies

(56)

\begin{equation} C( x, a ) \le 0, \qquad (x, a)\in \mathbb{K},\end{equation}

let $\{F_n\}_{n\in\mathbb{N}}$ be a sequence of subsets of S such that

(57)

\begin{equation} F_k \hbox{ is finite}, \quad F_k \subset F_{k+1}, \quad k\in\mathbb{N}, \qquad \bigcup_{k\in\mathbb{N}}F_k= S,\end{equation}

and define the truncated cost function $C_n$ by

(58)

\begin{equation} C_n(x, a) \,:\!=\, C(x, a)\mathbf{1}_{F_n}(x), \qquad (x, a) \in \mathbb{K}.\end{equation}

Now, let $f_0\in \mathbb{F}$ be a fixed stationary policy, and recall that, for each $x\in S$ , the action set A(x) is a compact subspace of the metric space A, so that there exists a sequence of $\{D_n(x)\}_{n\in\mathbb{N}}$ of subsets of A(x) such that

(59)

\begin{equation} D_k(x) \hbox{ is finite,} \quad D_k(x) \subset D_{k+1}(x), \quad k\in\mathbb{N}, \qquad D(x) \,:\!=\, \bigcup_{k\in\mathbb{N}}D_k(x) \hbox{ is dense in } A(x).\end{equation}

With this notation, for each $n\in\mathbb{N}$ define the collection $\{A_n(x)\}_{x\in S}$ of action sets as follows:

(60)

\begin{equation} A_n(x) = D_n(x), \quad x\in F_n, \qquad A_n(x) = \{f_0(x)\}, \quad x\in S\setminus F_n . \end{equation}

Next, let

(61)

\begin{equation} \mathcal{M}_n = (S, A, \{A_n(x)\}_{x\in S}, C_n, P), \qquad n\in\mathbb{N},\end{equation}

be the MDP obtained from $\mathcal{M}$ after replacing the actions sets $\{A(x)\}_{x\in S}$ and the cost function C by $\{A_n(x)\}$ and $C_n$ , respectively. It follows from (56)–(61) that, for every $n\in\mathbb{N}$ , the conditions in Theorem 2 are satisfied by $\mathcal{M}_n$ , and then there exists a stationary policy $f_n \in \prod_{x\in S}A_n(x)$ such that, given $w\in S$ , the pair

(62)

\begin{equation} (g_n, h_n({\cdot})) = (g(f_n), h_{f_n,g(f_n),w}({\cdot})) \in ({-}\infty, 0] \times \mathcal{B}(S)\end{equation}

has the following properties:

(63)

\begin{align} \text{e}^{\lambda g_n + \lambda h_n(x)} & = \sup_{a\in A_n(x)}\Bigg[\text{e}^{\lambda C_n(x,a)}\sum_{y\in S}p_{x,y}(a) \text{e}^{\lambda h_n(y)}\Bigg] = \nonumber \\ & = \text{e}^{\lambda C_n(x,f_n(x))}\sum_{y\in S}p_{x, y}(f_n(x))\text{e}^{\lambda h_n(y)}, \qquad x\in S,\end{align}

(64)

\begin{align} h_n({\cdot}) & \le M_w\|C_n- g_n\|, \qquad h_n(w) = 0, \qquad -\|C\| \le -\|C_n\| \le g_n \le 0, \end{align}

by Lemma 5 and (56). It follows that $g_n \in [{-}\|C\|, 0]$ and $h_n(x)b\in ({-}\infty, 2M_w\|C\|]$ , $x\in S$ , $n\in\mathbb{N}$ . Via Cantor’s diagonal method, it is possible to determine a subsequence of $(g_n,h_n({\cdot}))$ (by notational convenience, still denoted by $(g_n,h_n({\cdot}))$ ) which is convergent:

(65)

\begin{equation} \lim_{n\to\infty}g_n \,=\!:\, g^{\ast} \in [{-}\|C\|, 0], \qquad \lim_{n\to\infty}h_n(x) \,=\!:\, h^{\ast}(x) \in [{-}\infty, 2M_w\|C\|], \quad x\in S.\end{equation}

The following lemma, using the conditions and definitions in (56)–(65), is the last step before the proof of Theorem 1.

Lemma 6. The pair $(g^{\ast},h^{\ast}({\cdot}))$ in (65) satisfies the following properties.

(i) $h^{\ast}({\cdot})$ is a finite function, and
(66) \begin{equation} \text{e}^{\lambda g^{\ast} + \lambda h^{\ast}(x)} \ge \sup_{a\in A(x)}\Bigg[\text{e}^{\lambda C(x,a)} \sum_{y\in S}p_{x,y}(a)e^{\lambda h^{\ast}(y)}\Bigg], \qquad x\in S, \end{equation}
and then $(g^{\ast}, h^{\ast}({\cdot})) \in \tilde{\mathcal{G}}$ ; see Definition 1 and (24).
(ii) For each $x\in S$ ,
(67) \begin{equation} J_*(x) = g^{\ast} = J^{\ast}(x). \end{equation}

Proof. For (i), let $x\in S$ be arbitrary but fixed and select $n_x\in\mathbb{N}$ such that $x\in F_k$ if $k\ge{n_x}$ . Next, let $d\in D(x)$ be arbitrary, so that there exists $n_{(d)}\in \mathbb{N}$ such that $d\in D_k(x)$ for $k\ge n_{(d)}$ . It follows from (60) that $d\in A_n(x)$ for $n \ge \max\{n_x, n_{(d)}\}$ , and then (63) implies

\begin{equation*} \text{e}^{\lambda g_n + \lambda h_n(x)} \ge \text{e}^{\lambda C_n(x,d)}\sum_{y\in S}p_{x,y}(d)\text{e}^{\lambda h_n(y)} = \text{e}^{\lambda C(x,d)}\sum_{y\in S}p_{x,y}(d)\text{e}^{\lambda h_n(y)}, \quad n \ge \max\{n_x,n_{(d)}\}, \end{equation*}

where the equality is due to the definition of $C_n$ in (58). Taking the inferior limit as n goes to $\infty$ in this relation, (65) and Fatou’s lemma together lead to

\begin{align*} \text{e}^{\lambda g^{\ast} + \lambda h^{\ast}(x)} & \ge \text{e}^{\lambda C(x,d)}\liminf_{n\to\infty}\sum_{y\in S}p_{x,y}(d)\text{e}^{\lambda h_n(y)} \cr & \ge \text{e}^{\lambda C(x,d)}\sum_{y\in S}p_{x,y}(d)\liminf_{n\to\infty}\text{e}^{\lambda h_n(y)} \cr & = \text{e}^{\lambda C(x,d)}\sum_{y\in S}p_{x,y}(d)\text{e}^{\lambda h^{\ast}(y)}, \qquad d\in D(x). \end{align*}

Next, given $a\in A(x)$ , observe that (59) implies that there exists a sequence $\{d_k\}_{k\in\mathbb{N}}$ contained in D(x) such that $\lim_{k\to\infty}d_k = a$ ; replacing d by $d_k$ in the above display and taking the inferior limit as k goes to $\infty$ , from Fatou’s lemma and the continuity conditions in Assumption 1, it follows that

\begin{align*} \text{e}^{\lambda g^{\ast} + \lambda h^{\ast}(x)} & \ge \liminf_{k\to\infty}\text{e}^{\lambda C(x,d_k)}\sum_{y\in S}p_{x,y}(d_k)\text{e}^{\lambda h^{\ast}(y)} \\ & \ge \text{e}^{\lambda C(x,a)}\sum_{y\in S}\liminf_{k\to\infty}p_{x,y}(d_k)\text{e}^{\lambda h^{\ast}(y)} = \text{e}^{\lambda C(x,a)}\sum_{y\in S}p_{x,y}(a)\text{e}^{\lambda h^{\ast}(y)}. \end{align*}

Since $x\in S$ and $a\in A(x)$ are arbitrary, it follows that (66) holds. To complete the proof of part (i) we show that $h^{\ast}({\cdot})$ is a finite function. To achieve this goal, pick a stationary policy $f \in \mathbb{F}$ and define the sequence $\{\mathcal{S}_n\}$ of subsets of S as follows:

$$ \mathcal{S}_0 = \{w\}, \quad \mathcal{S}_k = \{y\in S \mid p_{x y}(f(x)) \gt 0 \hbox{ for some } x \in \mathcal{S}_{k-1}\}, \qquad k=1,2,3,\ldots, $$

and notice that $S = \bigcup_{k=0}^\infty \mathcal{S}_k$ , by Lemma 1(ii). Using (65), to show that $h^{\ast}({\cdot})$ is finite it is sufficient to verify that, for each $k \in \mathbb{N}$ ,

(68)

\begin{equation} h^{\ast}(x) \gt -\infty \hbox{ for } x\in \mathcal{S}_k, \end{equation}

a claim that will be proved by induction. Notice that $h^{\ast}(w) = 0$ , by (64) and (65), and then the desired conclusion holds for $k=0$ . Suppose that (68) is valid for some $k\in\mathbb{N}$ , let $\hat y\in\mathcal{S}_{k+1}$ , and pick $\hat x\in \mathcal{S}_k$ such that

(69)

\begin{equation} p_{\hat x, \hat y}(f(\hat x)) \gt 0. \end{equation}

Now observe that (66) implies that

$$ \text{e}^{\lambda g^{\ast} + \lambda h^{\ast}(\hat x)} \ge \text{e}^{\lambda C(\hat x,f(\hat x))} \sum_{y\in S}p_{\hat x,y}(f(\hat x))\text{e}^{\lambda h^{\ast}(y)} \ge \text{e}^{\lambda C(\hat x,f(\hat x))} p_{\hat x,\hat y}(f(\hat x))\text{e}^{\lambda h^{\ast}(\hat y)}. $$

Recalling that $\lambda \lt 0$ , the condition $h^{\ast}(\hat x) \gt -\infty$ implies that the left-most term in the above relation is finite, and then (69) allows us to conclude that $\text{e}^{\lambda h^{\ast}(\hat y)} \lt \infty$ , so that $h^{\ast}(\hat y) \gt -\infty$ ; since $\hat y\in \mathcal{S}_{n+1}$ is arbitrary, it follows that (68) holds with $k+1$ instead of k, concluding the induction argument. As already mentioned, this completes the proof of part (i).

For (ii), for each $n\in\mathbb{N}$ let $J^{*, n}({\cdot})$ be the superior limit optimal average cost function for model $\mathcal{M}_n$ . In this case, (62) and part (ii) of Theorem 2 yield that $g_n = J^{*,n}({\cdot})$ . Moreover, if $f_n$ is the stationary policy in (63), then Theorem 2(ii) applied to model $\mathcal{M}_n$ yields

$$ g_n = J^{*, n}(x) = \lim_{k\to\infty}\frac{1}{\lambda k} \log\Bigg(\mathbb{E}_x^{f_n}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{k-1}C_n(X_t,A_t)\Bigg\}\Bigg]\Bigg), \qquad x \in S; $$

see (7). Observe now that $C_n \ge C$ , since the cost function is negative, so that the above display leads to

\begin{align*} g_n & \ge \limsup_{k\to\infty}\frac{1}{\lambda k} \log\Bigg(\mathbb{E}_x^{f_n}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{k-1}C(X_t,A_t)\Bigg\}\Bigg]\Bigg) \\ & = \limsup_{k\to\infty}\frac{1}{k}J_k(f_n, x) = J(f_n, x) \ge J^{\ast}(x), \qquad x\in S; \end{align*}

see (8) and (9). From this point, (65) leads to $g^{\ast} \ge J^{\ast}({\cdot})$ . Recall now that $(g^{\ast},h^{\ast}({\cdot}))\in\tilde{\mathcal{G}}$ , by Lemma 6(ii), so that $g^{\ast}\le J_*({\cdot})$ , by Lemma 2(iii). Combining these two expressions with the inequality $J_*({\cdot}) \le J^{\ast}({\cdot})$ in (12), it follows that $J_*({\cdot}) = g^{\ast} = J^{\ast}({\cdot})$ .

The proof of the main result is finally presented below.

Proof of Theorem 1. Let $\lambda \lt 0$ be fixed, and suppose that Assumptions 1 and 2 hold. The proof has been split into two cases.

Case 1: $C(x, a)\le 0$ for every $(x, a)\in\mathbb{K}$ . Let $g^{\ast} \in \mathbb{R}$ and $h^{\ast}({\cdot})$ be as in (65). For (i), suppose that $g\in \mathcal{G}$ . In this case, $(g,h({\cdot}))\in\tilde{\mathcal{G}}$ for some function $h({\cdot})$ , and then $g \le J_*({\cdot})$ by Lemma 2(iii). Assume that $g \le J_*({\cdot})$ . Since $J_*({\cdot}) = g^{\ast}$ , by Lemma 6(ii) it follows that $g\le g^{\ast}$ , and then $\lambda g \ge \lambda g^{\ast}$ , since $\lambda$ is negative. Combining this last inequality with (66), and recalling that $h^{\ast}({\cdot})$ is a finite function (by Lemma 6(ii)), it follows that $(g,h^{\ast}) \in \tilde{\mathcal{G}}$ , and then $g \in \mathcal{G}$ ; see Definition 1 and (24).

For (ii), since $J_*({\cdot}) = g^{\ast}$ , by Lemma 6(ii), part (i) implies that $g \in \mathcal{G} \iff g \le g^{\ast}$ ; then $g^{\ast} = \sup\{g \mid g\in\mathcal{G}\}$ , and the desired conclusion follows from equality (67) established in Lemma 6(ii).

For (iii), since $h^{\ast}({\cdot})$ is finite, by Lemma 6(i), and $g^{\ast} = \sup\{g\mid g\in\mathcal{G}\}$ , by part (ii), the conclusion follows from inequality (66) established in Lemma 6(i).

For (iv), consider the model $\mathcal{M}_n$ in (61), let $(g_n, h_n({\cdot})) \in [{-}\|C\|, 0] \times \mathcal{B}(S)$ , and let $f_n\in\mathbb{F}$ be as in (62) and (63). It follows from Remark 2(iii) that $g_n = \lim_{k\to\infty}(1/k)J_k^{(n)}(f_n,x)$ , $x\in S$ , where $J_k^{(n)}(f_n,x)$ is the ( $\lambda$ -)certainty equivalent of the random cost $\sum_{t=0}^{k-1}C_n(X_t,A_t)$ with respect to $\mathbb{P}_x^{f_n}$ , so that, for each initial state x,

$$ g_n = \lim_{k\to\infty}\frac{1}{k\lambda} \log\Bigg(\mathbb{E}_x^{f_n}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{k-1}C_n(X_t,A_t)\Bigg\}\Bigg]\Bigg). $$

We show that policy $f_n$ is $\varepsilon$ -optimal if n is large enough. To achieve this goal, using that $C\le 0$ , observe that $C \le C_n$ , by (58), and then the above display implies that

\begin{align*} g_n & \ge \limsup_{k\to\infty}\frac{1}{k\lambda} \log\Bigg(\mathbb{E}_x^{f_n}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{k-1}C(X_t,A_t)\Bigg\}\Bigg]\Bigg) \\ & \ge \liminf_{k\to\infty}\frac{1}{k\lambda} \log\Bigg(\mathbb{E}_x^{f_n}\Bigg[\exp\Bigg\{\lambda\sum_{t=0}^{k-1}C(X_t,A_t)\Bigg\}\Bigg]\Bigg), \end{align*}

i.e. $g_n \ge J(f_n,x) \ge J_-(f_n,x) \ge J_*(x)$ , $x\in S$ , $n\in\mathbb{N}$ ; see (7)–(11). Since $J_*({\cdot}) = g^{\ast}$ , by part (iii), using that $g_n\to g^{\ast}$ as $n\to\infty$ , pick $n_*\in\mathbb{N}$ such that $g^{\ast} + \varepsilon \ge g_{n_*}$ to conclude from the above display that $g^{\ast} + \varepsilon \ge J(f_{n^{\ast}},\cdot) \ge J_-(f_{n^{\ast}},\cdot) \ge g^{\ast}$ , so that $f_{n^{\ast}}\in\mathbb{F}$ is $\varepsilon$ -optimal.

Case 2: $C \in \mathcal{B}(\mathbb{K})$ . The notation in Lemma 1 and Remark 3 will be used. For (i), consider the cost function $C - \|C\|$ , which is nonpositive. Using Remark 3, note that

\begin{align*} g\in \mathcal{G}(C) & \iff g-\|C\|\in \mathcal{G}(C-\|C\| ) \\ & \iff g-\|C\| \le J_*(C-\|C\|, \cdot) \\ & \iff g-\|C\| \le J_*(C,\cdot) - \|C\|, \end{align*}

where the second equivalence is due to part (i) in Case 1 applied to the cost function $C - \|C\|$ , and Lemma 1 was used in the last step. It follows that $g \in \mathcal{G}(C) \iff g \le J_*(C,\cdot)$ .

For (ii), an application of part (ii) of Case 1 to the cost function $C - \|C\|\le 0$ yields

$$J_*(C - \|C\|,\cdot) = \sup\{g \mid g \in \mathcal{G}(C - \|C\|)\} = J^{\ast}(C - \|C\|,\cdot),$$

a relation that, via (13) and (23), leads to $J_*(C,\cdot) = \sup\{g \mid g \in \mathcal{G}(C)\} = J^{\ast}(C,\cdot)$ .

For (iii), since $C - \|C\| \le 0$ , the third part of Case 1 yields that

$$\sup\{g \mid g \in \mathcal{G}(C - \|C\|)\} \in \mathcal{G}(C - \|C\|);$$

since $\mathcal{G}(C-\|C\|) = \mathcal{G}(C) -\|C\|$ , it follows that $\sup\{g \mid g \in \mathcal{G}(C)\} \in \mathcal{G}(C)$ .

For (iv), let $\varepsilon \gt 0$ be fixed. Applying the fourth part of Case 1 to the cost function $C - \|C\| \le 0$ , there exists $f_\varepsilon\in\mathbb{F}$ such that

\begin{align*} \sup\{g \mid g \in \mathcal{G}(C - \|C\|)\} + \varepsilon & \ge J(C-\|C\|,f_\varepsilon,\cdot) \cr & \ge J_-(C-\|C\|,f_\varepsilon,\cdot) \ge \sup\{g \mid g \in \mathcal{G}(C-\|C\|)\}, \end{align*}

a relation that, via (13) and (23), leads to $g^{\ast}+ \varepsilon \ge J(C,f_\varepsilon,\cdot) \ge J_-(C,f_\varepsilon,\cdot) \ge g^{\ast}$ , where $g^{\ast} = \sup\{g \mid g \in \mathcal{G}(C)\}$ , showing that $f_\varepsilon$ is $\varepsilon$ -optimal for the cost function C.

Remark 4. Finally, it is important to point out that the literature on risk-seeking is scarce. The problem addressed in this manuscript is interesting by itself. The case $\lambda\lt 0$ is associated with a risk-loving (risk-seeking) controller. A risk-seeking controller is willing to accept greater economic uncertainty in exchange for higher returns. Using a prospect theory approach, it has been found that people are risk-averse in the domain of gains but become risk-seeking in the domain of losses; see, for instance, [Reference Zaleskiewicz22].

7. Concluding remarks

This work has studied Markov decision chains endowed with average cost criteria. Besides standard continuity–compactness conditions, three essential assumptions determined the framework of the paper: (i) the simultaneous Doeblin condition holds, (ii) the state space is communicating under the action of each stationary policy, and (iii) the decision maker driving the system is risk-seeking. Within this context, a characterization of the optimal average cost was provided in Theorem 1, extending conclusions in [Reference Cavazos-Cadena7] obtained under the condition that the controller is risk-averse. Another interesting result stated in part (iv) of the main theorem is the existence of $\varepsilon$ -optimal stationary policies, which was obtained by truncating the cost function, instead of taking a policy that ‘almost optimizes’ the term within brackets of the optimality inequality (25). In this direction there is, at least, an interesting question to be analyzed in the future: Given $\varepsilon \gt 0$ , assume that $f\in \mathbb{F}$ is derived from the optimality inequality in such a way that action f(x) is an $\varepsilon |\lambda|$ -maximizer of the logarithm of the term within brackets in (25), i.e. $\text{e}^{\lambda(g^{\ast}+\varepsilon) + \lambda h^{\ast}(x)} \le \big[\text{e}^{\lambda C(x,f(x))}\sum_{y\in S}p_{x,y}(f(x))\text{e}^{\lambda h^{\ast}(y)}\big]$ for every $x\in S$ . With this notation, establishing that f is $\varepsilon$ -optimal is an interesting problem.

Acknowledgements

The authors are deeply grateful to the reviewers and the Associate Editor for their careful reading of the original manuscript and for their advice to improve the paper.

H. Cruz-Suárez and R. Montes-de-Oca dedicate this article to the memory of their collaborator and the co-author of the present work, Rolando Cavazos-Cadena, whose unfortunate death occurred on May 24, 2023.

Funding information

This work was partially supported by the PSF Organization under Grant No. 0116-18.

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

References

Ash, R. B. (1972). Real Analysis and Probability. Academic Press, New York.Google Scholar

Balaji, S. and Meyn, S. P (2000). Multiplicative ergodicity and large deviations for an irreducible Markov chain. Stoch. Process. Appl. 90, 123–144.CrossRef Google Scholar

Bäuerle, N. and Reider, U. (2011). Markov Decision Processes with Applications to Finance. Springer, New York.CrossRef Google Scholar

Billingsley, P. (2012). Probability and Measure. Wiley, New York.Google Scholar

Borkar, V. S. and Meyn, S. P. (2002). Risk-sensitive optimal control for Markov decision process with monotone cost. Math. Operat. Res. 27, 192–209.CrossRef Google Scholar

Cavazos-Cadena, R. (2009). Solutions of the average cost optimality equation for finite Markov decision chains: Risk-sensitive and risk-neutral criteria. Math. Meth. Operat. Res. 70, 541–566.CrossRef Google Scholar

Cavazos-Cadena, R. (2018). Characterization of the optimal risk-sensitive average cost in denumerable Markov decision chains. Math. Operat. Res. 43, 1025–1050.CrossRef Google Scholar

Cavazos-Cadena, R. and Fernández-Gaucherand, E. (2002). Risk-sensitive control in communicating average Markov decision chains. In Modelling Uncertainty: An Examination of Stochastic Theory, Methods and Applications, eds. M. Dror, P. L’Ecuyer and F. Szidarovsky. Kluwer, Boston, MA, pp. 525–544.Google Scholar

Denardo, E. V. and Rothblum, U. G. (2006). A turnpike theorem for a risk-sensitive Markov decision process with stopping. SIAM J. Control Optimization 45, 414–431.CrossRef Google Scholar

Di Masi, G. B. and Stettner, L. (2000). Infinite horizon risk sensitive control of discrete time Markov processes with small risk. Systems Control Lett. 40, 305–321.Google Scholar

Di Masi, G. B. and Stettner, L. (2007). Risk-sensitive control of discrete-time Markov processes with infinite horizon. SIAM J. Control Optimization 38, 61–78.CrossRef Google Scholar

Di Masi, G. B. and Stettner, L. (2007). Infinite horizon risk sensitive control of discrete time Markov processes under minorization property. SIAM J. Control Optimization 46, 231–252.CrossRef Google Scholar

Hernández-Lerma, O. (1989). Adaptive Markov Control Processes. Springer, New York.CrossRef Google Scholar

Howard, R. A. and Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Manag. Sci. 18, 356–369.CrossRef Google Scholar

Jaśkiewicz, A. (1989). Average optimality for risk sensitive control with general state space. Ann. Appl. Prob. 17, 654–675.Google Scholar

Kontoyiannis, I. and Meyn, S. P. (2013). Spectral theory and limit theorems for geometrically ergodic Markov processes. Ann. Appl. Prob. 13, 304–362.Google Scholar

Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra. SIAM, Philadelphia.CrossRef Google Scholar

Pitera, M. and Stettner, L. (2015). Long run risk sensitive portfolio with general factors. Math. Meth. Operat. Res. 82, 265–293.Google Scholar

Puterman, M. (1994). Markov Decision Processes. Wiley, New York.CrossRef Google Scholar

Sladký, K. (2008). Growth rates and average optimality in risk-sensitive Markov decision chains. Kybernetika 44, 205–226.Google Scholar

Stettner, L. (1999). Risk sensitive portfolio optimization. Math. Meth. Operat. Res. 50, 463–474.CrossRef Google Scholar

Zaleskiewicz, T. (2001). Beyond risk seeking and risk aversion: Personality and the dual nature of economic risk taking. Europ. J. Pers. 15, S105–S122.CrossRef Google Scholar

Article contents

Characterization of the optimal average cost in Markov decision chains driven by a risk-seeking controller

Abstract

Keywords

MSC classification

1. Introduction

1.1. Notation

2. Decision model

2.1. Policies

2.2. Utility function and average criteria

2.3. Optimality equation

2.4. Assumptions for a constant average cost

2.5. The problems

3. Main result

4. Fundamental tools

5. Optimality equation in a special case

6. Proof of Theorem 1

7. Concluding remarks

Acknowledgements

Funding information

Competing interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests