1. Introduction
Many scenarios in mining, finance, telecommunications, medical research, and other fields involve a situation where the reward collected from a resource fluctuates over time in a stochastic manner. For example, the yield obtained from mineral resource extraction varies as digging persists; the returns from financial investments vary based on many random market effects; the communication bit rate of communication channels varies based on physical conditions; and the findings in medical trials vary over time. Such a theme of randomly varying rewards, or a randomly modulated reward rate, reappears in multiple application areas. Hence designing, managing, and controlling such uncertain situations has been a focal point of stochastic operations research in a variety of contexts. The theory of restless bandits presents one general paradigm for dealing with such problems. See, for example, Gittins et al. [Reference Gittins, Glazebrook and Weber8], Whittle [Reference Whittle21], and Weber and Weiss [Reference Weber and Weiss20] for the general restless bandits problem. The main theme in such research is the efficient selection of channels/projects/arms over time.
In this paper, we add to the body of literature by considering problems with switching costs. Switching costs have been considered in Agrawal et al. [Reference Agrawal, Hegde and Teneketzis3], Banks and Sundaram [Reference Banks and Sundaram6], and Dusonchet and Hongler [Reference Dusonchet and Hongler7] in the context of multi-armed bandit processes, but to the best of our knowledge have not been studied as we do here. The general problem which we discuss is one in which a resource yields random rewards over time, with some known average reward rate and a maximal reward rate. One way to improve performance is to introduce additional independent instances of this resource and to consider a situation where at any time we use a resource of our (dynamic) choice. By doing so, we are potentially able to increase the obtained reward rate from the average reward rate towards the maximal reward rate.
As alluded to above, dynamic resource allocation problems span a variety of applications. However, one key area in which decisions often need to be made at the millisecond or second timescale is communication networks. Specifically dynamic resource allocation problems dealing with opportunistic scheduling as in Aalto et al. [Reference Aalto, Lassila and Taboada1] and further in [Reference Aalto, Lassila and Taboada2] have recently attracted attention. In such a setting, wireless cellular systems utilize random channel quality variations to minimize the flow-level holding costs. Some of this research has utilized the restless bandits formulation, as in Jacko and Villar [Reference Jacko and Villar9] (and further in [Reference Aalto, Lassila and Taboada1,Reference Aalto, Lassila and Taboada2] ). Related is the continuous time paradigm with impulsive control as in Ayesta et al. [Reference Ayesta, Gupta and Verloop5]. Switching costs often need to be taken into account and this is the key contribution of our current paper since the aforementioned works did not take switching costs directly into account.
Other related work in the context of communication networks is in the domain of handover problems. See, for example, the survey Wang et al. [Reference Wang, Chen, Aryafar and Chiang18] as well as Mezzavilla et al. [Reference Mezzavilla, Goyal, Panwar, Rangan and Zorzi17] where general problems dealing with handover and their solution via Markov Decision Processes (MDPs) are formulated. A key aspect of this domain is the potential lack of information with regard to channel quality. This introduces interesting problems and has attracted much work for which a survey describing contributions up to 2015 is in Kuhn and Nazarathy [Reference Kuhn and Nazarathy11]. In more recent years, several additional notable works in which channel information is not fully available are Kaza et al. [Reference Kaza, Meshram, Mehta and Merchant10], Larrnaaga et al. [Reference Larrañaga, Assaad, Destounis and Paschos12], and further in Larrañaga et al. [Reference Larrnaaga, Ayesta and Verloop13]. See also recent work dealing with minimizing the age of information as Maatouk et al. [Reference Maatouk, Kriouile, Assad and Ephremides15] and Meshram and Kaza [Reference Meshram and Kaza16].
As an abstract example of a communication link, assume a link that yields an average bit rate of $4$ Mbit/s and a maximal bit rate of $10$ Mbit/s where the actual instantaneous bit rate varies stochastically over time and achieves the maximal bit rate for random finite durations. By introducing multiple instances of such a link and allowing the system to switch between the instances without constraint, we are able to get an effective average bit rate that exceeds $4$ Mbit/s and potentially nears the maximal bit rate of $10$ Mbit/s. The key is clearly to use the right instance of the channel at “the right time,” so as to on-average use channels that do better than the average bit rate. At the extreme positive case, by introducing an increasing number of instances and assuming their stochastic behavior is independent between instances, we can get arbitrarily close to the $10$ Mbit/s bit rate. The other extreme is not switching between channels at all and settling for the average bit rate of $4$ Mbit/s, obtained from a single channel.
While the usage of such redundant channels can be of benefit, it is clearly not without cost. First, there are the structural costs of setting up additional channels. The exact nature of these costs depends on the application and is not our focus. Then there are costs associated with setting up systems for gaining information about the instantaneous state of all channels. Finally, there are dynamic instantaneous switching costs involved when switching between channels. In this paper, we study the tradeoffs and costs associated with this problem using the simplest example model that we could consider. The elegance of our simple model is that it captures the value of real-time information and at the same time allows us to compare how the addition of more channels to the system increases rewards. Real systems are bound to be more complex than the model which we present; however, the results from our model capture the essential tradeoffs that one can expect.
After reparameterization, our model is that each of $n$ channels yields instantaneous rewards of either $0$ or $1$ where switching between the reward states is according to a two-state continuous time Markov chain. The long-term average reward obtained by a channel is $\gamma \in (0,1)$ and the instantaneous cost of switching between channels is $c$. We distinguish between a case of full observation where the state of all channels is observable, and a case of partial observation where only the state of the current channel is available. For each of these cases, we suggest parameterized channel selection policies, and we are able to analyze and optimize the parameters of these policies in certain situations.
An illustration of the performance of such a model for the case of $\gamma = 0.4$ is in Figure 1. There are two threshold values at $\gamma =0.4$ and at $\gamma ^{2} = 0.16$. We see that in the full observation case, when $c > \gamma$, it is not worthwhile to use additional channels and otherwise the total rewards increases in an affine manner as $c \to 0$. Further, we see that in the partial observation case, usage of redundant channels is only worthwhile if $c < \gamma ^{2}$. In this case, the total reward rate increases in a nonlinear manner as $c \to 0$. In this case, we compare two policies that are described in the sequel, namely call-gapping and cool-off. While we do not have optimality proofs for these policies within the class of all policies, we show that these policies agree in performance for $n=2$ and $n \to \infty$. We further obtain explicit results for the case of $n=2$ and at $n=\infty$. Note that in all cases, as the number of channels increases, the achievable reward rate increases towards the maximal reward for small switching costs $c$. Our results in this paper help understand the tradeoffs between the number of channels, switching costs, information, and policy qualitatively.
The remainder of the paper is structured as follows: In Section 2, we present the model and summarize the main results. In Section 3, we further study the full observation case by considering simple steady state results, the Whittle Index, and the optimality. In Section 4, we derive results associated with the case of partial observations. In Section 5, we present numerical results, and finally, we conclude in Section 6.
2. Model and summary of results
We consider $n$ statistically identical channels $i=1,2,\ldots,n$, with state $X_i(t) \in \{0,1\}$, evolving independently according to a continuous-time, time-homogenous, two-state Markov chain model with generator
where $\lambda, \mu >0$.
At any given time, the controller may only use one of the channels. The controller's choice is indicated through $U(t) \in \{1,\ldots,n\}$, where $U(t)=i$ means that channel $i$ is being used at time $t$. There is a fixed switching cost of $\kappa \geq 0$ for every instant at which $U(t)$ is changed; that is, every time $t$ where $U(t^{-}) \neq U(t)$, an additional cost of $\kappa$ is incurred.
When $X_i(t) = x \in \{0,1\}$, the reward rate of that channel is denoted by $r(x)$, a function assumed identical for all channels. The goal of the controller is then to maximize the average reward,
where $N(t)$ denotes the number of channel changes during the time interval $[0,t]$.
By observing that
w.l.o.g., we are able to set $r(0)=0$ and $r(1)=\zeta >0$. Further, by rescaling time by a factor of $\eta = \lambda ^{-1}$ and setting $\zeta /\eta$ to be $1$, we parameterize the channel model only by $\gamma = \lambda /(\lambda +\mu )$ (the stationary probability of a channel being in state 1) and $c = \kappa /\eta$.
In what follows, we write $I(t) = X_{U(t)}(t)$, which denotes the state of the selected channel at time $t$. Then, the average reward $g$ corresponding to the parametrization just outlined is expressed as:
where $\widetilde {I}(t) = I(\eta t)$ and $\widetilde {N}(t) = N(\eta t)$.
Hence our model parameters are $\gamma \in (0,1)$ and $c \ge 0$ with a rescaled generator
We distinguish two situations:
Full Observation (Case I): The controller fully observes $\{X_1(t),\ldots,X_n(t)\}$.
Partial Observations (Case II): When $U(t) = i$ the controller only observes $X_i(t)$.
In considering this partial observation case, it is useful to consider belief states for $i=1,\ldots,n$, via
and in constructing the belief states, it is useful to denote the last time in channel $i$ via,
This allows the construction of the belief state of channel $i$ via
where $p(t ;\, x) := \mathbb {P}( X_i(t) = 1 \, | \, X_i(0) =x )$ which can be represented explicitly as
Observe that if $U(t) = i$ then ${\mathcal {T}}_i(t) = t$ and then $\omega _i(t) = X_i(t)$ which is either $0$ or $1$.
Control policies: We consider several policies all of which limit channel switching to times $t$ when $X_i(t) = 0$. This means that in the partial observation case (II), the belief state at any time $t$ for $i \neq U(t)$ is strictly monotonically increasing in time according to the first expression in (1). In both case I and case II, one trivial policy denoted $\pi ^{(s)}$ is the policy of never switching channel. Further, in case I, there is the policy $\pi ^{(a)}$ which switches to an arbitrary channel $j \neq U(t)$ if $X_j(t) = 1$ and $X_{U(t)} = 0$. That is, this policy ensures that the current channel is in state $1$ if such a channel exists, and further this policy does not switch excessively if not needed.
In case II, in addition to the policy $\pi ^{(s)}$, we consider a call-gapping policy, $\pi ^{(\tau )}$, and a cool-off policy $\pi ^{(\sigma )}$. In brief, call-gapping does not switch out of a channel prior to a call-gapping duration, parameter $\tau > 0$. Such a policy was analyzed in a different context in Lin and Ross [Reference Lin and Ross14] where it was termed call-gapping. Further, in brief, cool-off does not switch into a channel prior to a cooling-off period, parameter $\sigma > 0$.
Like all our policies, call-gapping does not switch out of a good channel; however, if the current channel $U(t)$ is in bad state, i.e. $X_{U(t)} = 0$ then this policy switches to the channel with the highest belief state as long as the time since the last switch is not less than $\tau$. That is, at time $t$ denote the last switching time via,
then the call-gapping policy switches only if $X_{U(t)} = 0$ and $t- {\mathcal {T}}^{\ell }(t) \ge \tau$. Note that in our context, since the channels are homogenous and $\omega _i(t)$ is monotonically increasing, when the call-gapping policy yields a switch then it is to a channel that was not used for the longest duration. This naturally yields round-robin behavior as w.l.o.g., we switch from channel $i$ to channel $i+1$ (modulo $n$).
As with the call-gapping policy, the cool-off policy does not switch out of a good channel and once switching occurs we switch with a round-robin manner, i.e. to the channel that has been visited longest ago. However, in contrast to the call-gapping policy, we choose to switch into a channel only if it has not been visited for a time that is at least the cool-off parameter $\sigma$. This essentially means that the cool-off policy considers the belief states of all channels and switches into a channel $i$ only if $U(t) = 0$ and $\omega _i(t)$ exceeds the threshold $p(\sigma ~;~0)$. Note that unlike the call-gapping policy, the cool-off policy may potentially incur multiple instantaneous switches. Also note that when $n=2$, the call-gapping and cool-off policies are identical. Both the call-gapping and cool-off policies ensure a strict upper limit on the switching costs, $c \widetilde {N}(t)/t$, as with these policies, almost surely,
Our results for these policies are as follows. First for case I, the performance of the system under $\pi ^{(s)}$ and $\pi ^{(a)}$ is tractable, for any value of $n$. This also allows us to characterize, when each of these policies is preferable as a function of the switching cost $c$ and the parameter $\gamma$.
In contrast, for case II, analysis is more difficult. When $n=2$, we are able to characterize the optimal call-gapping parameter $\tau ^{*}$. For finite $n > 2$, we have not found such an analytic characterization. However, we can consider systems with large $n$. For this, we construct an approximate model, that we label with $n = \infty$. Such a system has only two channels, say $i=1$ and $i=2$. When $U(t) = 1$, then $X_2(t)$ is assumed to be in the steady state. Similarly when $U(t) = 2$ then $X_1(t)$ is assumed to be in the steady state. That is, the current channel $X_{U(t)}$ behaves normally; however, at the moment of switching to the other channel, the state of that channel is drawn from the steady state distribution $[(1-\gamma )\enspace \gamma ]$, and the evolution continuous. Under a round-robin-based policy such as our $\pi ^{(\tau )}$ and $\pi ^{(\sigma )}$, with large $n$, the $n = \infty$ model approximates the system because we can expect the next channel in the round-robin to be at approximate steady state.
The nature of our analytic results is in comparing the different policies (and their parameters) for Case I, and Case II, and for different values of $n$. That is, for Case I, we determine when $\pi ^{(s)}$ is preferable to $\pi ^{(a)}$ and vice versa. We also determine the value of the long-term reward $g$. Similarly, for Case II (under $n=2$ and $n=\infty$), we compare $\pi ^{(\tau )}$, $\pi ^{(\sigma )}$, and $\pi ^{(s)}$, determine the optimal parameters for $\tau$ (or $\sigma$), and obtain expressions for $g$.
The following two theorems are proved in Sections 3 and 4, respectively.
Theorem 2.1. In Case I, the optimal choice between $\pi ^{(a)}$ and $\pi ^{(s)}$ is given by:
Further, under $\pi ^{*}$, the optimal expected average reward is given by:
Moreover, for $n=2$, the policy $\pi ^{*}$ is an optimal policy.
We also conjecture that in Case I, $\pi ^{*}$ is the optimal policy for any $n$. However, we are only able to prove this using computer algebra systems (such as Mathematica) for small $n$ by doing symbolic computations. Additionally, as we show in Section 3, $\pi ^{*}$ also corresponds to the Whittle Index-based policy.
A similar partial result is found in Case II: for any $n$, if $c > \gamma ^{2}$ then the optimal policy is $\pi ^{(s)}$ when compared against $\pi ^{(\tau )}$ and $\pi ^{(\sigma )}$. We now have,
Theorem 2.2. For Case II, we consider two scenarios. Firstly, restrict the policy space to be the set of call-gapping policies and denote the optimal policy therein by $\pi ^{*}_C$. Secondly, restrict the policy space to be the set of all cool-off policies and denote the optimal policy therein by $\pi ^{*}_D$. Then we have for $n \geq 2$:
When $c < \gamma ^{2}$ and $n = 2$, the optimal call-gapping time, $\tau ^{*}$ is the unique non-negative solution of the equation
The optimal cool-off level is given by $\sigma ^{*}=\tau ^{*}$. When $c < \gamma ^{2}$, $n = \infty$,
Further with $\pi ^{*}_C$ or $\pi ^{*}_D$, the optimal expected average reward is given by:
where,
Note that the expression for $g^{*}_C$ or $g^{*}_D$ as in Theorem 2.2 can be used to evaluate the performance for $n=2$ for any value of $\tau$ (alt. $\sigma$), by replacing $\tau ^{*}$ in the expression with $\tau$ (alt. $\sigma$). Also note that empirical evidence, see Figure 1, suggests that when $2 < n <\infty$ the optimal cool-off policy has a better expected average reward than the optimal call-gapping policy.
3. On the Whittle index and optimality for case I
We now prove Theorem 2.1 in two parts; first the expression for $g^{*}$ is derived by using a continuous-time birth and death process. Then we prove that $\pi ^{*}$ is the optimal policy for $n=2$, describe symbolic computations for proving it is optimal for higher order $n$, and conjecture it is optimal for any $n$. We then move on to the Whittle index associated with Case I.
Proof of correctness of g* expression for Theorem 2.1. We consider the process $\{M(t), \, t \ge 0\}$ on the state space $\{0,1,\ldots,n\}$ where,
The fact that the channels are independent and satisfy the same probability law, implies that $M(t)$ is a birth and death continuous time Markov chain, with birth and death rates
for $i \in \{0,\ldots,n-1\}$ and $i \in \{1,\ldots,n\}$, respectively. Such a birth and death process is known as an Ehrenfest model with binomial stationary distribution,
Under the policy $\pi ^{(a)}$, a reward rate of $1$ is accrued at times $t$ during which $M(t) > 0$. Further, switching costs are incurred at certain transition times of $M(t)$. Specifically at times $t$ when $M(t^{-}) = 0$ and $M(t) = 1$, a switching cost $c$ is incurred with probability $(n-1)/n$. This is the probability that the channel that changes to ‘good’ is not the currently selected channel and therefore an immediate channel change takes place. Similarly, at times $t$ when $M(t^{-}) = k \ge 2$ and $M(t) = k-1$, a cost of $c$ is incurred with probability $1/k$. This is the probability of the current channel turning ‘bad’ and hence requiring a switch. Now considering the reward as a Markov reward process, we have that average under policy $\pi ^{(a)}$ is,
where the second line follows from the structure of the birth and death rates (3). Further with $\pi ^{(s)}$, we have that $g$ is trivially $\gamma$. Hence $\pi ^{(a)}$ is preferable only when $c \le \gamma$.
To consider the optimality of $\pi ^{*}$, the processes $\{X_i(t)\}_{i=1}^{n}$ and $U(t)$ can be transformed to $\widetilde {W}(t)$ and $\widetilde {S}(t)$ where $\widetilde {W}(t)= X_{U(t)}(t)$ is the state of the current channel and,
is the number of other channels that are in the good state. The process
can then be represented as part of a continuous time MDP with singular controls, similar to the framework used in Ayesta et al. [Reference Ayesta, Gupta and Verloop5]. The state space is $\{0,1\} \times \{0,\ldots,n-1\}$. The action process $\{\widetilde {A}(t)\}$ takes values in the action space $\{0,1\}$ where $\widetilde {A}(t) = 0$ indicates staying on a channel at time $t$. Further at singular time points, $t$ in which $\widetilde {A}(t)=1$, an instantaneous channel switch is made (and immediately at $t^{+}$, $\widetilde {A}(t^{+}) = 0$) . In the construction of this process, we assume that if a switch is made and there is another channel available of a different state to $\widetilde {W}(t)$, then we switch into such an arbitrary channel and this implies state change. Otherwise (in the border cases in which a switch is made and there is not a different channel), the state is not changed.
To construct the continuous time MDP with singular controls, we use $I^{(a)}(w,s)$ as an indicator if transitions out of state $(\widetilde {W}(t),\widetilde {S}(t))=(w,s)$ are singular or not for action $a \in \{0,1\}$. If $I^{(a)}(w,s) = 0$ then we define transition intensities, $\widetilde {q}^{(a)}((w,s),\cdot )$. Alternatively if $I^{(a)}(w,s) = 1$ then there are transition probabilities, $\widetilde {p}^{(a)}((w,s),\cdot )$. In our case, it is simply that $I^{(a)}(w,s) = a$ and hence for action $a=0$ (do not switch), we have transition intensities and for action $a=1$ (switch), we have transition probabilities. These are constructed via,
where
and $\widetilde {q}^{(0)}((w,s),\cdot ) = 0$ otherwise. Further,
and $\widetilde {p}^{(1)}((w,s),\cdot ) = 0$ otherwise.
This continuous time MDP with singular controls can be converted to a continuous time MDP without singular controls using the same approach as in Ayesta et al. [Reference Ayesta, Gupta and Verloop5]. Further, we can uniformize with a uniformization rate of $n/\gamma$ to obtain a standard discrete time MDP with the same state space and action space. In this MDP, we denote the state process via
and the action sequence denoted $\{A(n)\}$ takes values in $\{0,1\}$. The transition probabilities for this MDP are,
where,
and ${p}^{(a)}((w,s),\cdot ) = 0$ otherwise. Note that the transition above marked via $(*)$ only occurs for systems with $n\ge 3$.
The rewards of this uniformized MDP are,
We are now able to prove that the optimal policy for this MDP is $\pi ^{*}$ in the case of $n=2$.
Proof of optimality of π* for n = 2 in Theorem 2.1. We carry out policy evaluation for this MDP under $\pi ^{*}$ and this yields a relative value function, $V^{*}: \{0,1\} \times \{0,1\} \to {\mathbb {R}}$, interpreted as,
where,
Note that this MDP is a uni-chain and thus $\overline {g}^{*}_{(w,s)}$ does not depend on the initial state and can be denoted as $\overline {g}^{*}$. Now we carry out explicit policy evaluation (under the policy $\pi ^{*}$) where we constrain $V^{*}(0,0) = 0$. This is done separately for $c<\gamma$ (where $\pi ^{*} = \pi ^{(a)}$) and $c \ge \gamma$ (where $\pi ^{*} = \pi ^{(s)}$). The resulting expressions for the relative value function and gains are,
and
Note that as expected, taking $n=2$, we have that $\overline {g}^{*} = ({\gamma }/{n}) g^{*}$, where $g^{*}$ is as in Theorem 2.1.
We now consider the Bellman equation for the MDP,
where $\bf {1}$ is a vector of ones, $V$ and $r^{(\cdot )}$ are the appropriate vectors and $P^{(\cdot )}$ is the appropriate transition probability matrix (using the lexicographic ordering of the state space). Using $V=V^{*}$ from above, expressions for the vector difference $R_1-R_0$ are as follows:
and
The respective signs of these entries are $(+,-,+,+)$ for $c< \gamma$ and $(+,+,+,+)$ for $c \ge \gamma$. These signs agree with $\pi ^{*}$ meaning that the only case for switching is in state $(0,1)$ when $c < \gamma$. Hence, $V^{*}$ is indeed an optimal solution of (4) because it was computed for $\pi ^{*}$ which agrees with these signs. Hence $\pi ^{*}$ is an optimal policy.
We note that using computer algebra systems (Mathematica in our case), we are able to explicitly carry out policy evaluation and obtain expressions for $V^{*}$ for $n=2,3,4,\ldots,8$ in both the $c <\gamma$ and $c \ge \gamma$ cases.Footnote 1 In all these cases, we can also compute $R_1-R_0$ and see that the signs of the entries agree with $\pi ^{*}$ and hence $\pi ^{*}$ is optimal. Note that these computations also involve finding an expression for the gain, $\overline {g}^{*}$. In all cases, we obtain $\overline {g}^{*} = ({\gamma }/{n}) g^{*}$ ($g^{*}$ is as in Theorem 2.1), similarly to the case $n=2$ in the proof above .
Hence for $n \in \{2,\ldots,8\}$, we essentially have a proof of optimality (in addition to the explicit proof written above for $n=2$). However, we were not able to distill explicit expressions for $V^{*}$ using a simple formula and hence we do not have a proof for arbitrary $n$. Nevertheless, we conjecture that $\pi ^{*}$ is optimal in Case I for any $n$. See the paper's associated GitHub repository Wang et al. [Reference Wang, Nazarathy and Taimre19] for the associated Mathematica code.
3.1. The Whittle index
It is also interesting to consider the system as if controlled by the Whittle index (see [Reference Gittins, Glazebrook and Weber8] for an overview). As this is a continuous time system with impulsive controls, we use a setup similar to Ayesta et al. [Reference Ayesta, Gupta and Verloop5].
In general, an index-based policy operates as follows. There is an index function ${\mathcal {I}}: {\chi } \to {\mathbb {R}}$, where ${\chi }$ is the state space of the channel. The index-based policy continuously monitors the index of each channel, say ${\mathcal {I}}_i$ for $i=1,\ldots,n$ by evaluating ${\mathcal {I}}(\cdot )$ on the channel state. It then ensures that the chosen channel is the channel $i$ with the highest index ${\mathcal {I}}_i$. In our continuous time setup, we assume that the policy only switches channels if necessary. That is, if there is a tie, there is not a switch.
The Whittle index is one such index function. A key component for the evaluation of the Whittle index is the one-arm subsidy problem, parameterized via $\nu \in {\mathbb {R}}$. In this problem, instead of considering the full system of $n$ channels we focus on a single channel and assume the controller has a choice between two actions: $a=0$ which means not using the channel and receiving a subsidy at rate $\nu$, or $a=1$ which means using the channel.
In the one-arm subsidy problem for our case, the state space is
where $x$ and $u$ are indicators of whether the channel is in a good state and whether the channel is in use, respectively.
Since we are dealing with impulsive controls, similarly to the $n$ channel continuous time impulsive MDP above, we use $I^{(a)}(x,u)$ as indicators for impulsive control. Here, in the one-arm subsidy problem, $I^{(a)}(x,u) = \mathbb {1}_{\{a \neq u\}}$. With these, we have transition rates (for state-action pairs without impulsive controls),
and $\widetilde {q}^{(\cdot )}((x,u),\cdot ) = 0$ otherwise for all other state-action pairs without impulsive control. Further, the transition probabilities (for state-action pairs with impulsive control) are,
and $\widetilde {p}^{(\cdot )}((x,u),\cdot ) = 0$ otherwise for all other state-action pairs with impulsive control.
In such a one-arm subsidy problem, the continuous time reward is taken as
Further in adapting the one-arm subsidy problem to our general control problem, we set instantaneous rewards at transitions where $I^{(a)}(x,u) =1$ at a value of $-c/2$. Here, the factor of $1/2$ captures the fact that a one-arm subsidy problem is not a real problem being used but rather an abstraction of a single channel vs. all channels together. In the real system, every channel switch would result in a cost of $c$ (reward of $-c$) and then the one-arm subsidy problem of the current channel “becomes” the problem of the next channel.
The continuous time MDP with singular controls of the one-arm subsidy problem can now be converted to a continuous time MDP without singular controls using the same approach as Ayesta et al. [Reference Ayesta, Gupta and Verloop5] and our analysis above. Further, similar to the MDP above, we can uniformize, now using a uniformization rate of $1/\gamma$. This yields a standard discrete time MDP with the same state space and action space. The transition probabilities for this uniformized MDP have for every $(x,u) \in \chi$,
and $p^{(\cdot )}((x,u),\cdot ) = 0$ otherwise. Further, the rewards are now,
Observe that the first term captures the uniformized continuous time reward and the second term captures the impulsive costs.
Now for any fixed $\nu \in {\mathbb {R}}$, consider the Bellman equation for the relative value function $V^{\nu }(x,u)$,
where as previously, $\bf {1}$ is a vector of ones, $V^{\nu }$ and $r^{(\cdot )}$ are the appropriate $4$-vectors and $P^{(\cdot )}$ is the appropriate transition probability matrix (using the lexicographic ordering of the state space) using (6), and $g^{\nu }$ is the (scalar) long-term average reward for the problem parameterized by $\nu$.
It can now be verified that the Whittle index for $c < \gamma$ is:
And further, the Whittle index for $c \ge \gamma$ is:
That is, for each state $(x,u) \in \chi$, $\nu (x,u)$ in (8) or (9) defines a Bellman equation (7) which has a solution in which $R_0^{\nu }(x,u) = R_1^{\nu }(x,u)$ (again using the lexicographic ordering for the vectors $R_0^{\nu }$ and $R_1^{\nu }$). This can be verified, also by noting that the associated value function solutions $(V^{\nu },g^{\nu })$ of (7) are as follows for $c < \gamma$:
And when $c \ge \gamma$ these are,
With these Whittle index expressions at hand, we now observe that a Whittle index-based policy agrees with $\pi ^{*}$ of Theorem 2.1.
Proposition 1. Consider Case I, and assume the system is controlled via a Whittle index policy, which ensures that at any time $t$ the chosen channel is $U(t)$ with the highest index as given by (8) for $c < \gamma$ or (9) for $c \ge \gamma$, and performs switching only when necessary. Then the system operating under this policy is equivalent to a system operating under $\pi ^{*}$ of Theorem 2.1.
Proof. Observe that both in the $c < \gamma$ and $c \ge \gamma$ cases,
Hence the system controlled by a Whittle index policy never switches out of a good channel $(X_{U(t)}(t) = 1)$.
Further, at time $t$ if $X_{U(t^{-})}(t^{-}) = 1$ and then $X_{U(t)}(t) = 0$ (the state of the current channel turns from the good state to the bad state), then if for all the other channels $j \neq U(t)$, $X_j(t) = 0$, then the index-based policy chooses to stay with $U(t)$, since both in the $c < \gamma$ and $c \ge \gamma$ cases, $\nu (0,0) < \nu (0,1)$.
Conversely in such a situation, if there is at least one good channel in the system, say $i \neq U(t)$ with $X_i(t) = 1$, then the index-based policy chooses to switch to channel $i$ (or any other good channel) only if $c < \gamma$, and otherwise stays. This is because, $\nu (0,1) < \nu (1,0)$ only if $c < \gamma$ and $\nu (0,1) = \nu (1,0)$ when $c \ge \gamma$.
4. Renewal reward analysis for case II
In this section, we prove Theorem 2.2 via a regenerative analysis for $n=2$ and later for $n=\infty$. We consider the regenerative structure based on regeneration points where switching has just occurred into a bad channel ($=0$). That is, take the time-axis and consider time points at which the controller switched (from a bad channel) into a bad channel. These are regeneration points which we denote by $T_0=0, T_1, T_2,\ldots$. We can assume that at time $t=0$ the system starts at such a point because we are looking at the infinite horizon average cost. Note that between $T_{k-1}$ and $T_k$ it is possible that there were other switches—namely switches into a channel that is in a good state ($=1$).
The net reward obtained during the interval $(T_{k-1},\,T_k]$, is given by
That is the total time during the interval when the channel was in a good state less the switching costs.
From the regenerative property of the process, it turns out that the sequence $(T_{k}-T_{k-1},\, V_k )$ is an i.i.d. sequence. We denote $W$ as a generic random variable distributed as $T_{k}-T_{k-1}$ and $V$ as a generic random variable distributed as $V_k$. It then follows from the “Renewal Reward Theorem,” see for example Theorem 1.2, Chap VI [Reference Asmussen4], that the average reward is given by
Note that by construction $W$ is non-lattice and both $V$ and $W$ have a finite mean. Hence our goal is now to compute the expectation of $V$ and $W$ under a call gaping policy with parameter $\tau$. For $n=2$ or $n=\infty$ this is equivalent to the cool-off policy with $\sigma = \tau$.
We now construct two generic random variables via their probability distributions that come to aid. We denote these as $W_0$ and $W_1$ and their CDFs by $F_{W_i}(t)$ for $i=0,1$. We have,
These random variables are mixtures of a mass at $\tau$ and a shifted exponential. The random variable $W_i$ denotes the time of switching under a policy with call-gapping parameter $\tau$ when at time $0$ a switch occurred into a channel in state $i$. It is constructed by observing that if at time $t=\tau$ the state is $0$ we switch with probability $1-p(\tau ;\, i)$; otherwise we wait an exponentially distributed duration until switching. Note that,
With the generic random variables $W_0$ and $W_1$ at hand, we can analyze a complete regenerative cycle of duration $W$. For this denote,
with $\widetilde {W}_i$ denoting the inter-switching times within the cycle. Here $M$ is a random variable with support $0,1,2,\ldots$, denoting the number of times we switch into a good channel within such a cycle. It then holds by construction that,
This follows since our regeneration points are the time points at which the controller switched from a bad channel into a bad channel. Then each cycle starts with a duration distributed as $W_0$. It is then followed by $M$ additional durations which are switches from bad states to good states, each distributed as $W_1$.
The following two lemmas yield explicit expressions for the denominator and numerator of (10).
Lemma 4.1. The denominator of (10) can be represented as,
Proof. Observe that,
where the indicator, $\widetilde {I}_i$, for $i=0,1,2,\ldots$ is equal to $1$ if the jump at the end of the interval associated with $\widetilde {W}_i$ was to a good channel. Note that the random variables $\widetilde {I}_i$ for $i=1,2,\ldots$ are identically distributed and $\widetilde {I}_0$ follows a different distribution. We can now denote generic random variables, $I_0$ and $I_1$ satisfying
For these two random variables, by conditioning on $W_i$ (for $i=0,1$), it holds that
where the second step follows because every switch is out of a bad channel and the third step follows from (11).
Observe that the sequence $\{\widetilde {I}_i\}$ is a mutually independent sequence of random variables and for each $i$, $\widetilde {I}_i$ is independent of $\widetilde {W}_j$ for all $j \neq i$. Based on the fact that ${ \mathbb {E}}[\widetilde {W}_i]$ is same as ${ \mathbb {E}}[W_1]$ for $i=1,2,\ldots$, we have
After manipulation using (1), (12), and (13), the result follows.
Lemma 4.2. The numerator, ${ \mathbb {E}}[V]$ of (10) can be represented as
where,
and,
Proof. The proof follows similar lines to the previous proof. We have that $V = R - c , (M+1)$ with $M$ as defined above and the reward $R$ being the net time during $[0,W]$ in which a good channel is selected. As with the analysis of the cycle duration, $W$, we denote,
where $\widetilde {R}_i$ is the reward accumulated over the period corresponding to $\widetilde {W}_i$. Similar to the previous analysis, we construct generic random variables $R_0$ and $R_1$ with
The expectations of these can be computed to be
Using $\widetilde {I}_i$ as in the previous proof, we observe,
where again for any $i$, $\widetilde {I}_i$ is independent of $\widetilde {R}_j$ for $j \neq i$. Now a similar geometric series to (14) is applied. The expression for $M$ is computed in a similar manner by observing
We can now prove Theorem 2.2 for $n=2$.
Proof. Combining and manipulating the explicit expressions from the lemmas above we obtain that under a policy $\pi ^{(\tau ^{*})}$ (alt. $\pi ^{(\sigma ^{*})}$), the reward as in (10) as a function of $\tau$ (alt. $\sigma$) is,
where $A_i(\cdot,\cdot )$, $i=1,2,3$ are as defined in Theorem 2.2. We consider first the case $c \ge \gamma ^{2}$. Observe that at $c = \gamma ^{2}$,
and hence the $\pi ^{(s)}$ policy is preferable to the $\pi ^{(\tau )}$ policy for any finite $\tau$ (note that as $\tau \to \infty$ the inequality above becomes an equality). Further $g(\tau )$ is monotonically decreasing in $c$ and hence for $c > \gamma ^{2}$ it remains optimal to use $\pi ^{(s)}$.
Moving to the $c < \gamma ^{2}$ case, we optimize the (continuous and differentiable) function $g(\tau )$ over $(0,\infty )$ to obtain Eq. (2) from Theorem 2.2. We do this by representing the derivative as $g'(\tau ) = h(\tau ) f(\tau )$ where,
and,
It can be shown that for $c < \gamma ^{2}$ the derivative $f'(\tau ) < 0$. Now since $h(\cdot )>0$ we have that, $g'(\tau ) = 0$ if and only if $f(\tau ) = 0$ and Eq. (2) is $f(\tau ^{*}) = 0$. Since $f(0)> 0$ and $\lim _{\tau \to \infty }f(\tau )<0$, it is evident that for $c<\gamma ^{2}$ there is a single root to $f(\tau ) = 0$ and further at the solution $\tau ^{*}$ the second order conditions hold,
Note that at the extremes of $\tau$, the reward $g(\tau )$ as in (15) yields the expected results. With zero switching costs,
and further without switching,
In the case of $n=\infty$, a similar (yet simpler) analysis to all of the above sections pursues. As described in Section 2, the system may be viewed as a two-channel system where every transition is always to a steady state channel. The usage of the transient probabilities $p(\tau ~;~i)$ as in (11) is then replaced by $\gamma$. This results in correspondently simpler expressions in all the cases where transient probabilities are used ($n=2$) and can now be replaced by $\gamma$. The resulting reward expression is then,
with the derivative,
This shows that $g(\cdot )$ is monotonic decreasing in $\tau$ if $c <\gamma ^{2}$ and monotonic increasing in $\tau$ if $c > \gamma ^{2}$. Hence for $c < \gamma ^{2}$ the optimal switching parameter is $\tau ^{*} = 0$ and yields reward as in Theorem 2.2. Further, since $\lim _{\tau \rightarrow \infty }g(\tau ) = \gamma$, for $c > \gamma ^{2}$ it is not optimal to use $\pi ^{(\tau )}$ for any finite $\tau$, and $\pi ^{(s)}$ is preferable.
5. Numerical results for case II
We now carry out numerical experiments for our system under various policies, focusing on Case II. We first describe the numerical computations and simulation experiment used for Figure 1 and then expand with further numerical results.
The Case I curves in the figure are obtained directly from Theorem 2.1. The Case II curves for $n=\infty$ are created directly using Theorem 2.2 and the $n=2$ curves are created by numerically solving Eq. (2) of Theorem 2.2 (using the bisection method). The remaining curves for $n=3$ and $n=4$ using both the call-gapping and the cool-off policies are obtained via Monte–Carlo simulation and optimization to find the optimal $\tau$ and $\sigma$ parameters for these policies.
This Monte–Carlo simulation (as well as other code) can be found in the GitHub repository Wang et al. [Reference Wang, Nazarathy and Taimre19]. We simulated the system over a grid of $c$ in the range $[0,\gamma ^{2}]$ where $\gamma ^{2} = 0.16$, with a spacing of $0.005$. Then for each value of $c$, we simulated both the call-gapping and cool-off policies over respective grids of $\tau$ and $\sigma$ in the range $[0,1.55]$, each with a spacing of $0.001$. In each of these individual simulation runs we kept a constant seed to reduce variability in the curves using common random numbers, and simulated for a time horizon of $T=10^{5}$, with an arbitrary fixed initial condition. We then obtained the average reward $g$ using a crude Monte–Carlo estimator over the continuous time $[0,T]$ and then plotted the reward for the optimal $\tau$ and $\sigma$ in each case.
In addition to the simulations leading to Figure 1, we also carried out extensive additional numerical experiments for more parameter values. Key results from these experiments are summarized in Table 1. Our purpose is to compare the cool-off and call-gapping policies for finite $n > 2$. As we observe from Figure 1, the cool-off policy is able to achieve a slightly higher average reward for $\gamma = 0.4$. We now considered the system for several additional $\gamma$ values and several values of $n$.
In each case, the cool-off policy is indeed slightly better than call-gapping. The numerical results in Table 1 report the maximal gap, as well as error estimates, over a range of $c$ values between the rewards of the policies. We also present an estimate of the $c$ value in which the gap is maximal and highlight in bold the values of $n$ for which the gap is largest.Footnote 2 As an overarching summary, we see that while the cool-off policy is slightly better, the difference between the rewards is never high and hence in practice using the simpler call-gapping policy is probably preferable. As expected from Figure 1, for each value of $\gamma$ there is some finite $n$ in which the maximal gap is highest (for $n=2$ and $n \to \infty$ there is no gap as the policies are identical). It is also evident that for low $\gamma$ values, the worst case $n$ is higher than that of higher $\gamma$ values. This observation informed our selection of the range of $n$ for which to run the experiments.
The computational experiments required non-negligible compute time because for any system setting $(\gamma,n,c)$ optimization over the best $\tau$ (resp. $\sigma$) parameter was required. We thus relied on an empirical observation associated with the nature of the system. We observed that for any fixed $\gamma$ and $c$, as $n$ increases, the optimal parameter ($\tau ^{*}$ or $\sigma ^{*}$) decreases with $n$ (with the optimal value tending to $0$ as $n \to \infty$). This allowed us to use Theorem 2.2 to first compute the optimal parameter for $n=2$ and use it as an upper bound for the (stochastic) search for the optimal parameter for $n=3$. Further for any $n=k \ge 4$, we used the estimated optimal parameter of $n=k-1$ to determine an upper bound. This allowed to use a fixed size search grid of size $20$ for the optimal $\tau$ (resp. $\sigma$) of each level $n \ge 3$. Other elements of the simulation involved the search grid over $c$. For this, after initial experimentation indicating the location of the maximal gap, we always considered $c \in [0,\gamma ^{2}/2]$ and searched over $25$ points within this grid. Each sample path involved a time horizon of $10^{3}$ time units, and for each parameter setting $(\gamma,n,c)$, we simulated $100$ repetitions, allowing us to obtain $95\%$ error bounds using a standard normal approximation. The simulation duration was in the order of $24$ hours on a standard laptop using the (compiled) Julia language.
In addition to the simulations, we also carried out a robustness analysis for Case II with $n=2$. In this case, the optimal call-gapping (and cool-off) parameter $\tau$ is easily obtained via Eq. (2). However, we also wish to see how misspecification of the system parameter $\gamma$, leading to a misspecification of the optimal $\tau$, affects performance. For this, we again consider the case where $\gamma = 0.4$; however, we now assume that the system is perceived to operate at a potentially different system parameter, namely $\widehat {\gamma }$. The controller then optimizes $\tau$ based on (2) and the perceived value, $\widehat {\gamma }$. We then compare the ratio of the actual reward, $\widehat {g}$ and the ideal reward, ${g}$ obtained using $\gamma = 0.4$. Both are obtained via the expressions in Theorem 2.2. The ratio is plotted in Figure 2 where we considered cost values $c = 0.04$, 0.08, 0.12, and $0.16$. As we see in Figure 2, as we might expect, for larger deviations of $\widehat {\gamma }$ from $\gamma = 0.4$ we get larger relative losses of the reward. However, the relative loss at a few percentage points in most cases. The plateaus observed on each of the curves are due to values of $\widehat {\gamma }$ during which $c > \widehat {\gamma }^{2}$ in which case the $\pi ^{(s)}$ policy is employed.
6. Conclusion and extensions
Our aim with this study was to obtain a qualitative view on the interaction of the number of channels, the available information, and the switching costs in a channel selection context. In such a setting, system designers can consider the value of information as well as the effect of switching costs (the value of efficient switching) in lieu of various control policies. Through a simple model, we obtained a qualitative relationship between the various system factors and in certain cases, we were able to explicitly analyze the system.
The case of full observation is straightforward to analyze; however, for the case of partial observation, to the best of our knowledge, explicit analysis is only attainable with $n=2$ and $n=\infty$ as we have done here. This is for the call-gapping and cool-off policies that we introduced. Further numerical analysis was carried out for other small finite $n$ illustrating that in practice there are only minor efficiency gains to be had by using cool-off instead of call-gapping. This is despite the fact that call-gapping policy only requires local information of the current state of the channel while the cool-off policy requires information for all channels. We complement our analysis with a numerical robustness experiment hinting that inexact knowledge of system parameters would not hinder performance greatly when using a call-gapping policy.
Our work did not focus on optimality of the policies discussed within the class of all possible policies. We conjecture that for $n=2$ call-gapping (equivalent to cool-off) is optimal in the case of partial observation. However, proving this remains a challenge for a future study. Further, real systems will typically exhibit a more complicated structure than two-state Markov chains. For such systems, adaptations of the call-gapping and cool-off policies may still be employed; however, the analysis is more complicated. Nevertheless, we believe that our qualitative and quantitative results based on two-state Markov chains may help serve as a guide for the tradeoffs between information, switching costs, policy complexity, and the number of channels in a system.
Acknowledgments
Jiesen Wang would like to thank the University of Melbourne for supporting her work through the Melbourne Research Scholarship and the Albert Shimmins Fund. Thomas Taimre is supported by the Australian Research Council (ARC) Centre of Excellence for the Mathematical and Statistical Frontiers (ACEMS). Yoni Nazarathy and Thomas Taimre are supported by ARC grant DP180101602.