1. Introduction
Consider a Markov chain $ \{ X_n \} $ with homogeneous transition kernel $ P_\theta $ on a general state-space supported on $\mathbb{R} $ . Moreover, we equip $\mathbb{R}$ with the usual topology, and, if not stated otherwise, measurability is understood with respect to the Borel field. Let $ A_\theta \subset \mathbb{R} $ denote a measurable set, and let $ \tau $ denote the first entry time into $ A_\theta $ when starting the chain in $x_0$ , i.e.,
with $ X_0=x_0$ , where we set $ \tau = \infty $ if the set on the right-hand side is empty. For technical convenience we assume throughout the paper that $ x_0 \not \in A_\theta $ . We consider the problem of estimating
for some measurable cost function $ h \,:\, \mathbb{R} \mapsto \mathbb{R}$ and $ x_0 \not \in A_\theta $ . Generally speaking, the problem arises naturally in analysing a system, modelled by $ P_\theta $ , that is operating until a certain event occurs, here modelled by $ X_n $ entering $ A_\theta $ , upon which some action is taken. For example, in inventory problems, orders are placed once the stock drops below a certain value; in maintenance, preventive maintenance is performed once the age of a component exceeds some threshold; and in service control, the speed of the server may be adjusted in case observed waiting times exceed some quality threshold.
In analysing a model we are interested not only in evaluating its performance but also in improving it. Let $ \theta $ be a control parameter of the model. For example, in an inventory model $ \theta $ may be the level that triggers the replenishing of stock. Changing the order level not only alters the cost, represented by h in the model, but may also have an impact on the demand, as having a low replenishing level may attract more demand. This illustrates that $ \theta $ may simultaneously affect the boundary of $ A_\theta $ , the instantaneous cost h, and the transition probability of the underlying process. In this paper, we will derive derivative expressions for this general case of $ \theta $ influencing all aspects of the model. As we will show, the overall formulas can be translated into efficient unbiased simulation-based estimators.
With the advent of simulation-driven perturbation analysis techniques, such as perturbation analysis [Reference Cao7, Reference Fu and Hu12, Reference Fu13, Reference Glasserman14, Reference Ho and Cao24], the score function method [Reference Rubinstein and Shapiro37], and weak differentiation [Reference Pflug34], efficient unbiased estimators for the expression in (1) have been found for particular instances of the problem. More specifically, for applications of perturbation analysis to (s, S) inventory models with $ \theta $ being either threshold s, S, or the span $ S-s $ , we refer to [Reference Bashyam and Fu5, Reference Bashyam and Fu4]. Alternatively, a variant of the score function method, called the push-out score function, has been developed for this kind of threshold optimization in inventory models; see [Reference Pflug and Rubinstein35], as well as [Reference Rubinstein36] for an early reference. For a perturbation analysis of a maintenance model where $ \theta $ triggers (preventive) maintenance of system components whose age exceeds $ \theta $ , we refer to [Reference Heidergott and Farenhorst-Yuan19, Reference Heidergott16]. Finally, for applications to financial models, where $ \theta $ is the value of a barrier in some option model, we refer to [Reference Glasserman15, Reference Lyuu and Teng31, Reference Heidergott17, Reference Heidergott, Leahu and Volk-Makarewicz21].
General results for the case where $ A_\theta $ and the instantaneous cost are independent on $ \theta $ , and only the underlying stochastic process depends on $ \theta $ , are provided in [Reference Heidergott and Vázquez-Abad22, Reference Heidergott and Vázquez-Abad23]. Moreover, in the setting of Markov chains with discrete state-space, sensitivity analysis of the expected return time to a specified marked state $i^\ast $ (also called time to absorption) with respect to the entries of the Markov transition probabilities is well-studied. Expressed in our setting, in these studies, we have $ h = 1 $ , $ A = \{ i^\ast \} $ contains the marked state, and the transition probability of the random walk $ X_n $ depends on $ \theta $ . See, for example, [Reference Caswell8, Reference Caswell9] for details.
Before laying out the main steps of our approach for providing an unbiased estimator for the derivative in (1), we provide examples illustrating the relevance of the extension to the case $ A_\theta $ . Without loss of generality, we consider $ \theta $ to be scalar, as for the multidimensional case $ \theta \in \mathbb{R}^k $ , $ k > 1 $ , estimation of a gradient can be reduced to that of estimating the partial derivatives separately.
Example 1. (Service quality.) Customers arrive at a service station according to a renewal point process. The interarrival times $ \{ I_n \,:\, n \in \mathbb{N} \}$ are independent and identically distributed (i.i.d.), with density $ f_I ( x ) $ such that $ 0< E [ I_n ] <\infty$ and $\mathbb{P} (I_n=0)=0$ . Customers are served in order of arrival, and consecutive service times are i.i.d. random variables $ \{ S_n( \theta ) \,:\, n \in \mathbb{N} \}$ with density $ f_S ( x ;\, \theta ) $ , for $ \theta \in \Theta \subset \mathbb{R}$ , such that $ 0< E [ S_n ( \theta) ] <\infty$ and $\mathbb{P} (S_n ( \theta ) =0)=0$ . Interarrival times and service times are assumed to be mutually independent. Consider the process of consecutive waiting times $\{X_n \}$ , denoting the time that the corresponding customer has spent at the service station from arrival to beginning of service. The service system starts initially empty. Consecutive waiting times $ X_n $ follow the recursive relation
and $X_1= 0$ . Letting $ A_\theta = [ \theta , \infty ) $ and $ h ( X_i) \equiv 1$ , with
we obtain from (1) the sensitivity of the expected index of the first customer who experiences a waiting time exceeding $ \theta$ .
Example 2. (Investment.) For an investment strategy, a portfolio of m risky assets is considered. The value of asset i after the nth dividend date is denoted by $ Z_{ n } (i) $ , and we let $ X_n = ( Z_n ( 1 ) , \ldots , Z_n ( m ) )$ , $ n\geq 0 $ . From a risk management perspective we are interested in the expected time until the value of the basket drops below $ \theta$ :
for
The threshold $ \theta $ may be used in an investment strategy where the investor is only inclined to restructure the portfolio once the value of the basket drops below $ \theta $ .
Example 3. (Insurance.) Consider an insurance company where the nth claim $ W_n $ arrives at time $ t_n $ . For simplicity, we assume that there is a premium inflow of c per time unit. The capital at time t, denoted by K(t), is then given by
where $ \tau_ 0 = 0 $ , N(t) is the number of claims incurred up to time t, and $ K_0 $ is the initial capital. The time until ruin is given by
The capital K(t) increases between claims, and the only possible time epochs where K(t) can fall below 0 are the jump epochs of N(t). Hence, letting $ X_n = ( \eta_n , W_n , \hat K_n)$ , with $ \eta_n $ the time between the nth claim and the $ (n+1)$ th, $ \hat K_n $ the capital right after the nth claim, and $ X_0 = ( 0 , 0 , K_0)$ , ruin occurs for $ \hat K_n < 0 $ . However, the insurance company will have to act before the zero-threshold is reached; in practice, the insurance company is interested in the time until the capital drops below a certain threshold $ \theta$ . Let
denote the index of the claim causing the capital to drop below $ \theta $ , and let $ \rho_n $ denote the nth jump time of N(t); then $ \rho_{ \tau_\theta} $ yields the time at which the capital drops below $ \theta $ .
For our analysis, we take a Markov chain operator approach. In the following we focus on motivating our approach; formal definitions are postponed to Section 2.1. Specifically, let $ P_\theta $ denote the Markov kernel of $ \{ X_n \} $ . For analysing (1) we make use of the concept of a taboo kernel. More specifically, the taboo kernel of $ P_\theta $ is a truncation of $ P_\theta $ onto the complement of taboo set $A_\theta$ , defined by
for all measurable sets B. In words, the taboo kernel avoids entering the set $A_\theta$ . When it is clear with respect to which set $ A_\theta $ the taboo kernel is constructed, we simplify the notation by letting
This implies
provided the integral exits, where the transition from $ x = X_i $ to $X_{i+1} $ is governed by $ P_\theta $ . Defining powers of transition kernels in the obvious way by $ Q^n = Q^{n-1} Q $ , with $ Q^0$ denoting the identity, we have
for $ x_0 \not \in A_\theta $ and $ {\textbf{T}}^0_\theta $ being the identity operator. For technical convenience we include the indicator $ {\textbf{1}}\{ X_0 \not \in A_\theta \} $ in the indicator product; we may do this without loss of generality, thanks to our assumption that the initial state does not belong to the taboo set $ A_\theta $ .
Since $ x_0 \not \in A_\theta $ , we obtain for the random horizon summation
where
is called the potential of the taboo kernel $ {\textbf{T}}_\theta$ .
The paper is organized as follows. In Section 2 differentiation of the taboo kernel $ {\textbf{T}}_\theta $ is studied. The main result of the paper on the differentiability of the potential $ {\textbf{H}}_\theta $ is provided in Section 3. Section 4 is devoted to translating the overall formulas into gradient estimators. Section 5 addresses applications. We conclude by identifying topics of further research.
2. Differentiation of one-step taboo kernel
2.1. Transition kernels and norms
Let $ ( S , {\mathcal{T}} ) $ be a Polish measurable space, for $ S \subset \mathbb{R}$ . For ease of presentation we restrict our analysis to the one-dimensional case. Let $ {\mathcal{M}} ( S , {\mathcal{T}} ) $ denote the set of finite (signed) measures on $ ( S , {\mathcal{T}}) $ and $ {\mathcal{M}}_1 ( S , {\mathcal{T}}) $ that of probability measures on $ ( S , {{\mathcal{T}} }) $ . Let $ \eta \,:\, \mathbb{R} \rightarrow [ 1 , \infty ) $ be such that $ \inf_x \eta ( x ) =1 $ . The weighted supremum norm with respect to $ \eta $ of a cost function $\psi$ is defined by
and that of a (signed) measure $ \mu \in {\mathcal{M}} ( S , {\mathcal{T}} ) $ by
with $ \mu^+ , \mu^- $ denoting the Hahn–Jordan decomposition; see [Reference Cohn10]. Typically, $ \mu $ has Lebesgue density $ f_\mu $ and we have
Note that the property $ \eta \geq 1 $ implies
moreover,
The mapping $ P \,:\, {{\mathcal{T}} } \times S \rightarrow [0,1] $ is called a (homogeneous) transition kernel on $ ( S , {\mathcal{T}} ) $ if (i) $ P ( \cdot ;\, s ) \in {\mathcal{M}} ( S , {\mathcal{T}} ) $ for all $ s \in S $ , and (ii) $ P ( B ; ) $ is $ {\mathcal{T}} $ measurable for all $ B \in {\mathcal{T}} $ . If, in the condition (i), ${\mathcal{M}} ( S , {\mathcal{T}} ) $ can be replaced by $ {\mathcal{M}}_1 ( S , {\mathcal{T}} ) $ , then P is called a Markov kernel on $ ( S , {\mathcal{T}} ) $ . Denote the set of transition kernels on $ ( S, {\mathcal{T}} ) $ by $ {\mathcal{K}} ( S , {\mathcal{T}} ) $ and the set of Markov kernels on $ ( S , {\mathcal{T}} ) $ by $ {\mathcal{K}}_1 ( S , {\mathcal{T}}) $ . A transition kernel $ P \in {\mathcal{K}} ( S , {\mathcal{T}} ) $ with $ 0 < P (S ;\, s ) < 1 $ for at least one $ s \in S $ is called a defective Markov kernel.
Consider a family of Markov kernels $ (P_\theta \,:\, \theta \in\Theta )$ on $( S, {\mathcal{T}} )$ , with $\Theta\subset \mathbb{R}$ , and let $ {\mathcal{L}}^1 ( P_\theta ;\,\Theta )\subset\mathbb{R}^{S}$ denote the set of measurable mappings $g \,:\, S \rightarrow \mathbb{R} $ such that $ \int_{S} | g ( u )| P_\theta ( du ;\, s ) $ is finite for all $ \theta \in \Theta $ and $ s \in S $ . To simplify the notation, we set
for $ g \in {\mathcal{L}}^1 (P_\theta ;\,\Theta ) $ and $ s \in S $ .
In what follows, we let $ \Theta $ be an open neighbourhood of $ \theta_0 $ and assume that ${{\mathcal{D}}} \subset {\mathcal{L}}^1 ( P_\theta ;\, \Theta )$ .
We call $P_\theta \in {\mathcal{K}} ( S , {\mathcal{T}}) $ differentiable with respect to $ \theta $ for $ {\mathcal{D}} $ , or $ \Theta\hbox{-}{\mathcal{D}} $ -differentiable for short, if $\partial_\theta P_{\theta} \in {\mathcal{K}} ( S , {\mathcal{T}} ) $ exists such that for any $ g \in {\mathcal{D}} $ and any $ s \in S $ ,
If the left-hand side of (6) equals zero for all $ g \in {\mathcal{D}} $ , then we say that $ \partial_\theta P_\theta$ is not significant.
In the same vein, we call $P_\theta \in {\mathcal{K}} ( S , {\mathcal{T}}) $ differentiable with respect to x for $ {\mathcal{D}} $ , or X- $ {\mathcal{D}} $ -differentiable for short, if $ \partial_x P_{\theta} \in {\mathcal{K}} ( S , {\mathcal{T}} ) $ exists such that for any $ g \in {\mathcal{D}} $ and any $ s \in S $ ,
If the left-hand side of (7) equals zero for all $ g \in {\mathcal{D}} $ , then we say that $ \partial_x P_\theta$ is not significant.
Assume that $ P_\theta $ has density $ f_\theta ( y ;\, x ) $ , i.e., $ P_\theta ( B ;\, x ) = \int_B f_\theta ( y ;\, x ) dy $ for all $ x \in S $ and $ B \in {\mathcal{T}}$ , and suppose that the partial derivatives of $ f_\theta ( y ;\, x ) $ with respect to x and $ \theta $ , denoted respectively by $ \partial_x f_\theta ( y ;\, x ) $ and $ \partial_\theta f_\theta ( y ;\, x ) $ , both exist. Provided that for all $ x, y \in S $ it holds that $ \partial_y f_\theta ( y ;\, x ) , \partial_x\, f_\theta ( y ;\, x ) , \partial_\theta f_\theta ( y ;\, x ) > 0 $ implies $ f_\theta ( y ;\, x ) > 0 $ , we let
and
The mapping $ {\rm SF}_X ( y ;\, x ) $ is called the X-score function acting on the conditioning state (and is explained in more detail in Remark 1 below), the mapping $ {\rm SF}_Y ( y ;\, x ) $ is called the Y-score function acting on the resulting state, and $ {\rm SF}_\Theta ( y ;\, x ) $ is called the $ \Theta$ -score function acting on the system parameter. Under appropriate smoothness conditions,
and
We revisit the sojourn time example put forward in Example 1. Denote the transition kernel of the waiting time chain by $ P_\theta $ ; i.e.,
for $ x \geq 0 $ and $ B \subset ( 0 , \infty ) $ a (Borel) measurable set, or, more formally,
and otherwise, for $ B \in [ 0 , \infty )$ with $ 0 \in B $ ,
Inserting $ B = (0 , y ] $ and differentiating with respect to y, we obtain the density of the continuous part of the transition kernel on $ ( 0 ,\infty ) $ as
$ y > 0 $ , where we assume that $ f_I ( 0 ) = 0 $ , and
where $ \delta_z\, f_I ( x+s-y ) $ is shorthand notation for the derivative of $ f_I ( z ) $ with respect to z evaluated at $ z = x + s - y $ . If, for example, interarrival times are exponentially distributed with rate $ \lambda$ , i.e., $ f_I ( x ) = \lambda e^{ - \lambda x }$ , $x \geq 0 $ , then, by $ \delta_x\, f_I ( x ) = - \lambda f_I ( x ) $ ,
so that in this case
Following the same line of argument, we have that
Furthermore, assume that $ S ( \theta ) $ follows a log-normal distribution with location parameter $ \theta $ and scale parameter $ \sigma$ :
It is well known that the log-normal distribution provides a good fit to the length of calls in a service centre; see, for example, [Reference Brown6]. Then it is easily checked that
and it follows straightforwardly that
The above expression cannot be applied in a simulation setting as such. In the following we will therefore rewrite $ \,{\rm SF}_\Theta ( y ;\, x ) $ by conditioning on the interarrival times. More specifically, let z denote the interarrival time and write $ f ( y ;\, x , z ) $ for the density of $ X_{n+1} $ given $X_n=x $ and $ I_{n+1} =z $ ; then
and
When evaluating $ \,{\rm SF}_\Theta $ via simulation, we first observe $ X_n $ , $ X_{n+1} $ , $ I_{n+1} $ , and $ S_n ( \theta ) $ , and we then apply differentiation conditional on $ I_{n+1}$ , which gives
In words, we introduce a sub-Markov chain, where, for given $ X_n $ , first the interarrival time is sampled, giving $ X_n - I_{n+1} $ , and then in a subsequent step the service time is sampled and added.
The $ \eta$ -norm is extended to transition kernels through the operator norm; that is, for $ Q \in {\mathcal{K}} ( S , {\mathcal{T}} ) $ we have
Note that $ || I ||_\eta = 1 $ for I being the identity operator. By algebra,
for all x. In the following we let $ \eta_\alpha ( x ) = \alpha^{ |x |} $ , for $ x \in \mathbb{R}$ , and for $ x \in [ 0 , \infty ) $ it suffices to let $ \eta ( x ) = \alpha^x $ . For a polynomial function $\psi$ , we have $\|\psi\|_{\eta_\alpha}<\infty$ for any $\alpha>1$ . In slight abuse of notation, we will simplify $ || \cdot ||_{\eta_\alpha } $ to $ || \cdot ||_\alpha $ .
Example 4. Let $ A_\theta = [ 0 , \theta ] $ and consider the taboo kernel $ {\textbf{T}}_\theta = [ P_\theta ]_{ [ 0 , \theta ] }$ , with $ P_\theta $ the transition kernel of the waiting times introduced in Example 1. We denote the stability region of the queue by $ \Theta = \{ \theta >0 \,:\, \mathbb{E} [ S ( \theta ) - I ] < 0 \} $ . Then
For $ [ \theta_1 , \theta_2 ] \subset \Theta $ , there exists $ \alpha^\ast $ such that $ \mathbb{E} \big[ \alpha^{ S ( \theta ) - I} \big] $ has negative right-hand-side derivative at $ \alpha = 1 $ and is strictly convex as a mapping in $ \alpha $ on $ [ 1 , \alpha^\ast ) $ , uniformly on $ [ \theta_1 , \theta_2 ]$ . This shows that there exists $ \bar \alpha > 1 $ such that
For details we refer to [Reference Kartashov26, Reference Leahu29].
Note that for given $ S ( \theta ) $ and I, we obtain from the above argument that the expected accumulated cost exists and is finite for all h such that $ || h ||_\alpha < \infty $ , i.e., for all cost functions h that are bounded by $ c \bar \alpha^x $ or some finite constant c. Indeed, letting $ \tau $ denote the first time a waiting time becomes smaller than $ \theta $ , for $ h \geq 0 $ and $ 1 \leq \alpha \leq \bar \alpha $ we obtain, from $ || {\textbf{T}}_\theta||_\alpha < 1 $ together with $ || {\textbf{T}}_\theta^0 ||_\alpha = || {\textbf{T}}_\theta ||_\alpha ^0 = 1 $ and $ || T^i_\theta ||_\alpha \leq || T_\theta ||_\alpha ^i $ for $ i \geq 1 $ , that
provided that $ || h ||_\alpha $ is finite.
2.2. Differentiability of the taboo kernel
We call a taboo kernel $ {\mathcal{D}} $ -differentiable if a linear operator $ \big ( [P_\theta ]_{{A}_{\theta} }\big )^{\prime} $ exists such that for any $ h \in {\mathcal{D}} $ it holds that
If the above holds, then $ \big( [P_\theta ]_{{A}_{\theta} }\big)^{\prime}$ is the $ \Theta \hbox{-}{{\mathcal{D}}}$ -derivative of $ {\textbf{T}}_\theta $ and we let $ {\textbf{T}}^{\prime}_\theta \,:\!=\, \big( [P_\theta ]_{{A}_{\theta} }\big)^{\prime} $ .
Notice that the derivative of the taboo kernel $ [P_\theta ]_{{A}_{\theta}} $ cannot be straightforwardly obtained in a pathwise sense through differentiating the density as the support of the taboo kernel $A_{\theta}$ is parametrized, which introduces discontinuities. In the same way, the derivative of the taboo kernel $ [P_\theta ]_{{A}_{\theta}}$ cannot be obtained by simply taking the taboo kernel of the measure-valued derivative of $ P_\theta$ , as it is in general not true that $ [ \partial_\theta P_\theta ]_{A_\theta} $ is the derivative of $ [ P_\theta ]_{A_\theta} $ ; in formula, $ [\partial_\theta P_\theta ]_{{A}_{\theta}} \not = \big ( [P_\theta ]_{{A}_{\theta}} \big )^{\prime}$ .
For the analysis provided in this section we require $ A_\theta $ to depend on $ \theta $ in a smooth way. This is expressed in the following.
Assumption 1. (Boundary parametrization.) There exists a boundary mapping $ g_\theta ( x )$ , such that
-
• for some set $ \Omega $ , independent of $ \theta $ , we have
\begin{equation*}x \in A_\theta^c \quad { if\ and\ only\ if } \quad g_{\theta} (x ) \in \Omega ;\end{equation*} -
• the partial derivatives of $ g_\theta$ with respect to $ \theta $ and x exist;
-
• the set $ \{ x \,:\, \partial_x g_\theta ( x ) = 0 \} $ has Lebesgue measure zero, so that
\begin{align*}\delta_\theta(x) = - \frac{\partial_{\theta} g_\theta(x) }{ \partial_x g_\theta(x)}\end{align*}is a differentiable and Lebesgue-almost-surely well-defined mapping.
As an illustration of Assumption 1, note that the stopping criterion in (3) is formalized by $ \Omega = [ 0 , 1 ) $ and
In this paper, we show that ${\textbf{T}}^{\prime}_\theta $ can be defined via cost functions $ \psi $ from an appropriate class of functions, to be specified later, so that
where
Suppose that $ P_\theta$ is differentiable in an appropriate way, e.g., weakly differentiable [Reference Heidergott, Hordijk and Weisshaupt20, Reference Heidergott and Vázquez-Abad23], with derivative $ P^{\prime}_\theta $ ; then
would be the first part of the estimator representing the derivative of $ P_\theta $ on the complement of the taboo set. In order to get a derivative for $ {\textbf{T}}_\theta $ rather than just $ P_\theta$ , we need an additional term. This compensator collects the state-space derivatives; if we denote this operator by $ {\textbf{D}}$ , Equation (11) reads
where
and
The kernel $ {\textbf{D}}_\theta $ lives on the complement of the taboo set $ A_\theta$ . We provide an explicit form of $ {\textbf{D}}_\theta $ in the following example.
Example 5. Consider the taboo set $ A_\theta = ( \theta , \infty)$ ; then we may let $ \Omega = [ 0 , 1 ]$ with $ g_\theta ( x ) = x / \theta $ , for $ x \geq 0 $ . For $ \psi ( y ) = y^p $ , for some $ p \geq 1$ , we obtain
Because of the inner differentiation $ \partial y $ under the integral in $ {\textbf{D}}_\theta $ , the operator $ {\textbf{D}}_\theta $ splits naturally into two parts. Indeed, writing $ {\textbf{D}}_\theta $ in explicit form, we obtain the following for $ {\textbf{T}}^{\prime}_\theta $ :
In words, the derivative of $ {\textbf{T}}_\theta $ splits into three parts: (i) the first part assesses the sensitivity of the Markov chain outside the taboo set $ A_\theta $ ; (ii) the second part measures the knock-on effect of continuing the process on states belonging to the boundary of $ A_\theta$ , where the unperturbed process otherwise would have stopped; and (iii) the third part measures the effect perturbing $ \theta $ has on the performance at the next step reached from the state x, where the perturbation is weighted by the impact of $ \theta $ on the boundary. We explain the meaning of the second part in more detail in the following remark.
Remark 1. Let $ \tilde \psi $ be some cost function and denote the one-step-ahead expected value by
Observe that under appropriate smoothness conditions
which shows that differentiation of the cost function $ \psi $ can be captured by differentiation of the transition density, provided that $ \psi $ itself can be interpreted as a conditional expectation. This is exactly what happens when products of Markov kernel operators are differentiated. In the following we provide a more general discussion of this phenomenon. Note that, for $ n \geq 1 $ ,
is the conditional expectation of the reward after n transitions of the taboo chain. Interchanging differentiation and integration leads to
Using the X-score function, this derivative can easily be evaluated as follows:
In words, the differentiation in the part $ {\textbf{T}}^{\prime}_2 $ of the derivative representation in (12) relates to differentiating a conditional expectation with respect to the conditioning value, which can be dealt with by using the X-score function.
Differentiating a Markov chain with taboo set $ A_\theta $ is related to the differentiation of a distribution with $ \theta$ -dependent support. In the following, we compare the application of our approach to the case of a distribution with $ \theta$ -dependent support.
Example 6. Let $ A_\theta = ( \theta, 1 ] $ , for $0 < \theta < 1 $ , and $ f_\theta ( y ;\, x ) = 1 $ on [ 0, 1 ] and zero otherwise. The taboo kernel can now be written as the defective uniform distribution
which is clearly the expected value with respect to the uniform distribution on $ [ 0 , \theta ] $ scaled by $ \theta $ . This is a classical example that cannot be handled by the score function, as the density $ 1_{ [ 0 , \theta ] } ( y ) $ fails to be Lipschitz continuous; here it is well known that the derivative is given by
provided that $ \psi $ is continuous.
Let $ g_\theta ( y )= y / \theta $ , $ \Omega = [ 0 , 1 ] $ , and $ f_\theta ( y ;\, x ) = 1_{ [ 0 , 1]} $ . Noticing that $ \partial_\theta f_\theta ( y ;\, x ) = 0 = \partial_y f_\theta ( y ;\, x ) $ and
we obtain from (12) (where we set the score functions to zero)
For example, in the case of $ \psi ( y ) = y^p $ , for some $ p >1 $ , it is easily seen that
which indeed yields the true derivative.
The uniform distribution on $ [ 0 , \theta ] $ can be retrieved by
Note that, for $ \psi $ continuous, it holds that
for all x, where the right-hand side is an expression for the measure-valued derivative of the uniform distribution. Comparing the above measure-valued derivative with our new one,
obtained via (14), it is apparent that the resulting estimators are different. However, both estimators have as a common part
The difference lies in the fact that the (positive part of the) measure-valued derivative concentrates mass on the boundary of the taboo set, whereas our new estimator leads to integration over the complement of the taboo set. Our approach requires $ \psi $ to be differentiable, whereas for measure-valued differentiation only continuity is required.
That our approach requires stronger conditions on the performance function than classical measure-valued differentiation stems from the fact that we develop a differentiation theory for Markov chains (in contrast to the differentiation of unconditional measures), which requires capturing the knock-on effect of a perturbation on the future development of the Markov chain; see Remark 1.
In the following we provide the theoretical development for the estimator in (11). We denote the Euclidean distance between y and $\bar{y}$ by $ \lambda(y,\bar{y}) $ and define the distance from y to a set A by $ \lambda(y,A ) = \inf \{ \lambda(y,\bar{y}) \,:\, \bar y \in A \} $ . Note that $ \lambda(y,A ) = 0$ in the case $ y \in A $ . Recall that $ \Omega $ denotes the base set from which $ A^c_\theta $ is obtained via $ A^c_\theta = \{ x \,:\, g_\theta ( x ) \in \Omega \} $ ; see Assumption 1. We now define $ \epsilon$ sets around the boundary $\partial \Omega=\overline{\Omega}\setminus \Omega^{o}$ through
and
We let
where, by definition, $\chi_{\epsilon}(y)=0$ for $y\in \Omega_{-\epsilon}$ , $\chi_{\epsilon}(y)=1$ for $y\in \Omega_{\epsilon}^c$ , and $\|\chi_{\epsilon}\|_{\infty}\leq 1$ (where $||\psi||_{\infty}\,:\!=\, \sup_{y\in\mathbb{R}}|\psi(y)|$ for $ \psi\,:\, \mathbb{R} \mapsto \mathbb{R} $ ). It is easy to see that $\chi_{\epsilon}(y)$ converges to ${\textbf{1}}\{y\in\Omega\}$ , except for $ y \in \partial \Omega $ , as $ \epsilon $ tends to zero. Let
where $\phi_{\epsilon}$ is the density of a centred normal distribution with variance $\epsilon^2$ , i.e.,
In words, $ \bar{\chi}_{\epsilon}(y) $ is a smoothed version of $ {\chi}_{\epsilon}(y) $ using a standard normal density as mollifier. Smoothness of the normal density implies
The following lemma summarizes properties of $ \bar{\chi}_{\epsilon} $ .
Lemma 1. The functions $\{ \bar{\chi}_{\epsilon} \}$ are smooth and satisfy
Proof. The smoothness of $\bar{\chi}_{\epsilon}(y)$ comes directly from the construction of the convolution, and $||\bar{\chi}_{\epsilon}||_{\infty}\leq || \chi_{\epsilon}||_{\infty}\leq 1$ , which straightforwardly leads to the smoothness and boundedness of $\bar{\chi}_{\epsilon}$ . In addition,
which leads to the conclusion.
We introduce the following transition kernels:
and
Note that $ {\textbf{U}} $ will be used to establish bounds needed for applying the dominated convergence theorem. The kernels $ {{\widetilde{\textbf{P}}}_\theta } $ , $ {{\widehat{\textbf{P}}}_\theta } $ , and $ {{\bar{\textbf{P}}}_\theta } $ represent different parts of the gradient expression.
Assumption 2. (Boundary integration.) Assume the following:
-
(i) The sets $\partial A_\theta=\overline{A}_\theta\setminus A_\theta^{o}$ and $ \mathcal{N}_\theta = \{ x\in\mathbb{R}\,:\,\partial_x g_{\theta}(x)=0 \}$ have Lebesgue measure zero. Moreover, we assume that $ \mathcal{N}_\theta $ is measurable for all $ \theta $ .
-
(ii) It holds for $ \alpha $ that
\begin{equation*}\lim_{y\to \pm \infty}\alpha^{2|y|}f_\theta(y;\,x)=0 .\end{equation*} -
(iii) The operator $ {\textbf{U}} $ has bounded norm $ \|{\textbf{U}}\|_{\alpha^2}<\infty $ .
-
(iv) The Lebesgue measure $\mathscr{L}\,({\cdot})$ of $A^\epsilon_\theta\setminus A_\theta$ converges uniformly to zero, i.e.,
\begin{equation*}\lim_{\epsilon\to0}\sup_{\theta\in\Theta}\mathscr{L}\,\big(A^\epsilon_\theta\setminus A_\theta\big)=0,\end{equation*}where\begin{equation*}A^{\epsilon}_{\theta}\,:\!=\, \{ x\in \mathbb{R}\,:\, \: g_\theta(x)\in \Omega_{\epsilon} \} \: .\end{equation*} -
(v) We have
\begin{align*}\sup_{\theta \in \Theta} || {\rm SF }_\Theta ||_\alpha &<\infty , \qquad\sup_{\theta \in \Theta} || {\rm SF }_X ||_\alpha<\infty , \\\sup_{\theta \in \Theta} || \nu_\theta ||_\alpha &<\infty , \qquad\sup_{\theta \in \Theta} || \partial_\theta g_\theta ||_\alpha<\infty, \qquad\sup_{\theta \in \Theta} || \delta_\theta ||_\alpha<\infty .\end{align*}
The following results are needed to prove unbiasedness of the derivative estimator. The technical conditions put forward in Assumption 2 guarantee that the mass of the set effected by changing the boundary of $ A_\theta $ is sufficiently smooth; see the conditions (i) and (iv). The remaining conditions establish that all objects needed for the analysis are uniformly (in $ \theta $ ) well-defined in the norm sense. In the case when the process $ X_n $ exhibits some monotonicity in $ \theta $ , these conditions reduce to assuming that the norms are well-defined for the ‘worst’ choice of $ \theta $ .
Lemma 2. Under the conditions (iii)–(v) in Assumption 2, we have
and
Proof. By definition, (v) implies that for some finite constant c,
This implies
by (iii). The rest of the results in the first part of the conclusion can be proved similarly. By definition, for some finite constant c,
Then,
From the conditions (iii) and (iv),
by the absolute continuity of Lebesgue integral [Reference Rudin39]. The rest of the results in the second part of the conclusion can be proved similarly.
Recall the definition of the v-norm $ || \cdot ||_\alpha $ in (23). We denote the set of differentiable mappings with finite $ || \cdot ||_\alpha $ -norm and finite derivative $ || \cdot ||_\alpha $ -norm by
The following theorem establishes our main differentiability result with respect to a single-step transition.
Theorem 1. Under Assumption 1 and Assumption 2, it holds for all $ h \in {\mathcal{D}} $ that
with ${\textbf{T}}^{\prime}_\theta $ as in (11), or, equivalently,
Proof. First, we have for any h in $ {{\mathcal{D}}} $ that
where the interchange of differentiation and integration in the second equality is justified by the dominated convergence theorem, using that
and
which are justified by Lemma 1 and Assumption 1. By the chain rule, we solve for $ \bar{\chi}^{\prime}_{\epsilon}(g_{\theta}(y)) $ from
which implies that for every $y\in \mathbb{R}\setminus \mathcal{N}_\theta $ ,
The value of the function on $\mathcal{N}_\theta $ , which has zero Lebesgue measure by Assumption 2, does not affect the value of the integral. Inserting (17) into (16) yields
Letting
and
and applying the formula for integration by parts, $ - \int u^{\prime} v = - u v + \int u v^{\prime}$ , to the right-hand side of (18) yields
By Assumption 1,
To summarize,
where the integrability of h ′ stems from the assumption that $ h^{\prime} \in {\mathcal{D}} $ . To justify the interchange of derivative and $ \epsilon$ -limit, we have
where
and
Summarizing the above analysis, we have
which is equivalent to
This proves the theorem.
Remark 2. The theory justifying the interchange of limit, derivative, and integral can be found in [Reference Rudin38, Reference Rudin39]. If h is a constant, the estimator in the theorem goes back to the generalized likelihood ratio method in [Reference Peng, Fu, Hu and Heidergott32].
3. Differentiation of the potential kernel
In this section, we derive the measure-valued derivative of the taboo kernel in an n-step transition, which is used to establish the product rule for the differentiation operation defined by (11).
Theorem 2. Under Assumption 1 and Assumption 2, if $ || P_\theta ||_ \alpha < \infty $ , then it holds for all $ h \in {\mathcal{D}} $ that
Proof. For $ k \geq 1 $ we have
where the interchange of differentiation and integration is justified similarly as in Theorem 1 by noticing
see (10). Then,
By Assumption 2,
As for the term $b_2$ ,
We now bound $ [a]+ [b_2]$ , where we take $ [b_2]$ as on the right-hand side above,
where
and
Following the line of argument put forward in the proof (see (19)), we arrive at
The case of $k=0$ has already been treated in Theorem 1.
By algebra,
There exists a $\Delta_0>0$ small enough so that for every $|\Delta|<\Delta_0$ , it holds that $\theta+\Delta\in\Theta$ . For $|\Delta|<\Delta_0$ , there exists $\zeta\in\Theta$ such that
by the mean value theorem. Notice that for $ k \geq 1 $
By the dominated convergence theorem,
which leads to the conclusion.
In order for $( {\textbf{H}_\theta } h) ( x_0 ) $ (the accumulated cost up to the time of hitting the set $ A_\theta $ ) to exist, the process $X_n$ needs to exhibit some kind of drift towards $ A_\theta $ . This is well known from the theory of geometric ergodicity, where the drift towards a recurrent set or state is used to construct a taboo kernel. We refer to [Reference Kartashov26] for details and to Example 4 for an illustration of this principle. In words, $ || {\textbf{T}}_\theta ||_\alpha < 1 $ means that the set $ A_\theta $ carries enough mass of $ P_\theta $ so that removing $ A_\theta $ from the state-space reduces the norm of $ P_\theta $ to less than one.
Theorem 3. Under Assumption 1 and Assumption 2, if
then it holds for all $ h \in {\mathcal{D}} $ that
or, equivalently, $ {\textbf{H}}_\theta $ is $ {\mathcal{D}}$ -differentiable with derivative
Proof. By Assumption 2 we have
Letting
it follows from (10) and the assumption that $ \rho <1 $ that
The proof of the proposition above is a simple application of Theorem 2 and the dominated convergence theorem.
In the case $ A_\theta = A $ , Theorem 3 recovers the known results for measure-valued differentiation and the score function; see [Reference Heidergott and Vázquez-Abad22] and the discussion therein.
4. Estimation procedures
In this section we translate the gradient expression put forward in Theorem 3 into simulation estimators. Let
where x denotes the initial value of the individual transition operators $ {\textbf{T}}^{\prime}_\theta {\textbf{T}}_\theta^k$ .
Suppose that we evaluate $ \big( {\textbf{T}}^{\prime}_\theta {\textbf{T}}_\theta^k \psi\big) ( X_i ) $ at $ X_i \not \in A_\theta$ . This kernel represents the event that we start in $ X_i $ and evaluate the expected value of $ \psi$ along a path that does not enter the taboo set, where the first transition of the path is governed by the derivative kernel $ {\textbf{T}}^{\prime}_\theta $ . Elaborating on (12), we can replace $ {\textbf{T}}^{\prime}_\theta $ by $ {\textbf{T}}_\theta $ with appropriate scaling via the score functions and the derivatives of $ \delta_\theta $ and $ \psi $ . This leads to the following estimator for $ \big( {\textbf{T}}^{\prime}_\theta {\textbf{T}}_\theta^k \psi\big) ( X_i ) $ at $ X_i \not \in A_\theta$ :
where
and, for $k=0$ , we can estimate $ {\textbf{T}}^{\prime}_\theta {\textbf{T}}_\theta^0 \psi ( X_i ) = {\textbf{T}}^{\prime}_\theta \psi ( X_i ) = \mathbb{E} [ \phi_\psi ( X_i ;\, X_{i+1} ) | X_i ]$ , for $ X_i \not \in A_\theta $ , via
Consider a cycle $ \{ X_0 , X_1 , \ldots , X_{\tau ( x ) -1 }\} $ of the Markov chain, with $ X_0 = x \not \in A_\theta $ , where we have expanded the notation for the first entrance time into $ A_\theta $ , denoted so far by $ \tau $ , by explicitly indicating the initial state. Starting the count from 0, the cycle length is $ \tau ( x ) $ . As the transitions from $ X_0 $ to $ X_1 $ up to the transition from $ X_{\tau_{(x ) }-2 } $ to $ X_{ {\tau ( x )} -1 } $ do not enter the taboo set, they may be considered as having been sampled via $ {\textbf{T}}_\theta $ , whereas the transition from $ X_{ \tau ( x ) -1 } $ to $ X_{ \tau (x)} $ cannot be considered as having been driven by $ {\textbf{T}}_\theta $ , since with this transition the Markov chain enters the taboo set. This shows that there are $ \tau ( x ) -1 $ taboo kernel transitions in a cycle of length $ \tau ( x ) $ , and the indices of the states from which a taboo kernel transition is executed run from 0 to $ \tau ( x ) - 2 $ . Following the line of argument put forward in (5),
for $ x \not \in A_\theta $ , where we set the sum to zero in the case where the lower bound is larger than the upper bound (which happens for realizations where $ \tau ( x ) = 1 $ ).
Treating the outer sum in the same way, we obtain from Theorem 3, with $ x_0 \not \in A_\theta $ ,
Simulation offers the opportunity to first sample the time horizon $ \tau ( x ) $ . Then, re-initiating the random number stream, we can exactly repeat the stochastic experiment. Given $ \tau ( x )$ , we can sample $ \sigma $ independently from everything else uniformly distributed on $ \{ 0 , 1 , \ldots ,$ $ \tau( x)-2 \} $ . With this definition we can apply the randomization to the outer $ \tau (x_0)$ sum in (22) and obtain the following alternative version of the estimator:
Notice that the above estimator contains several likelihood ratio terms, namely, ${\rm SF}_\Theta ( y ;\, x ) $ , $ {\rm SF}_X ( y ;\, x ) $ as a term of $\nu_\theta$ , and ${\rm SF}_Y ( y ;\, x ) $ , which may lead to a large variance of the derivative estimator. Following the standard approach in measure-valued differentiation [Reference Heidergott, Hordijk and Weisshaupt20, Reference Heidergott18, Reference Heidergott and Vázquez-Abad23, Reference Pflug33, Reference Pflug34], we may apply the Hahn–Jordan decomposition,
where
by noticing that
The Hahn–Jordan decomposition is often not easy to sample from; typically one can find decompositions that are easier to sample from (see [Reference Heidergott, Hordijk and Weisshaupt20]).
We turn to the state derivative, i.e., the Y -score function $ {\rm SF}_Y ( y ;\, x ) $ . We obtain that
where
by noticing
The estimator also requires differentiation of the conditional expectation of the future costs with respect to the initial value captured by $ {\rm SF}_X ( y ;\, x ) $ . For this we construct the decomposition
where
by noticing that
Elaborating on the above decompositions, the score functions in the estimator can be replaced by appropriate splittings of the sample paths, where the positive and negative parts are given by the Markov kernels corresponding to the above densities. We refer to the literature on measure-valued differentiation for details.
Following the framework provided in [Reference Peng, Fu, Hu and Heidergott32], the results presented in this paper can be extended to the multidimensional case, where $ S \subset \mathbb{R}^d $ , for $ d > 1 $ . Specifically, the weight supremum norm is extended to $ \mathbb{R}^d $ by letting
5. Applications
Consider the following simple capital flow model for an unlimited liability company. Suppose at the nth period a company receives a random profit of $W_n$ . If the company has a positive capital at the nth period, then (i) $rX_n$ is given to the shareholder as a dividend, and (ii) the remaining capital is topped up by $ M > 0 $ , and the excess of capital $\max\!( r X_n - M , 0 ) $ is extracted; otherwise, the company receives $-r X_n$ capital from the shareholders, for some $ r \in ( 0 ,1 ) $ . The company has credit to borrow money up to $\theta\geq 0$ , and if the company’s capital is below $-\theta$ , then it is ruined and no longer has any debt obligations. Then, with $ X_0 = x $ ,
and
for $ X_n \leq M $ and M otherwise. The cash flow of the company is
where $ X_0 = x_0 $ gives the initial value. Let $W_n$ , $n\in\mathbb{N}$ , be a sequence of i.i.d. random variables that follow a normal distribution with mean $\mu=0.1$ and variance $\sigma^2=1$ , and set $x_0=0.3$ , $r=0.3$ , and $\theta=0$ . Let $ P_\theta $ denote the transition kernel of the Markov chain $ X_n $ . Then $ (P_\theta) ( dy , x ) $ has a transition density
on $ ({-} \infty , M ] $ and a point mass
We have $\Omega=({-} \infty , 0 ] $ , $g(x;\,\theta)=x+\theta$ , $ A_\theta = ({-} \infty , - \theta ] $ , and ${\textbf{T}}_\theta = [ P_\theta ]_{ ({-} \infty , - \theta ] }$ .
We are interested in estimating the sensitivity of the total cash flow until ruin,
We have
and letting $ \alpha = 1 $ (which we may do as $ |X_n | \leq M $ is bounded for $ | \theta | < M$ and thus $ \psi ( X_n ) $ also is) gives
for Z a standard normal random variable. It follows that
for any $ \theta \leq 0 $ . Obviously, Assumptions 1 and 2 are satisfied, and by Theorem 3 we obtain
For ease of analysis, we let $ M = \infty $ in the following. This is justified as $ r X_n > M$ is a rare event for the numerical setting of the example, which can be easily justified numerically. Note that $\delta_\theta(x)=-1$ , and
After some algebra, we obtain for the unbiased estimator
We plot the total cash flow and the expected time to ruin in Figure 1. We compare the performance of our estimator to that of a finite-difference estimator (FD) defined by
see [Reference L’Ecuyer and Perron30]. We denote the FD with perturbation size $0.1$ by FD $(0.1)$ , and the FD with perturbation size $0.01$ by FD $(0.01)$ . The means and standard errors of the three derivative estimators are reported in Table 1, where we show results for $ r =0.3 $ and additionally for $ r =0.9 $ , denoting by n the number of independent experiments used for the estimators.
In Table 1, we can see that the estimator in this paper leads to the lowest variance, while FD suffers from the bias–variance dilemma. In addition, the computation time of our estimator is about half that of FD, because the former only needs to simulate one sample path, while the latter needs to simulate two sample paths. The key to applying the common random numbers (CRN) technique is to synchronize the simulations of the two sample paths [Reference Law and Kelton28]. For the random horizon problem, it is not clear how to efficiently synchronize simulation of the sample paths under the perturbed parameter. Thus, it may not be easy to implement CRN to reduce the variance of FD. In this example, there is no need to calculate the Hahn–Jordan decomposition, but we note that one could split the random variables as follows: $\hat{c}=-1/\sqrt{2}$ , $\bar{c}=(1-r)/\sqrt{2\pi}$ , and
where Wei(a, b) denotes a Weibull-distributed random variable with the density
6. Conclusion
In this paper, we have developed an approach for differentiating random horizon problems, where the parameter of interest may affect the stopping criterion as well as the transition dynamics of the underlying stochastic process. We discuss the multidimensional case and various implementations of the obtained derivative expressions.
Acknowledgements
The authors express their gratitude to Professor Michael Fu for fruitful discussions of an earlier draft of the paper, and to anonymous reviewers and the associate editor for their constructive comments.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process for this article.