1. Introduction
We are interested in maximizing a function $U\,:\,\mathbb{R}^{d}\to\mathbb{R}$ which is unknown. However, we can observe a sequence $J(\theta,X_{n})$ , $n\geq 1$ , where $J\,:\,\mathbb{R}^d\times\mathbb{R}^m\to\mathbb{R}$ is measurable,
and $X_{n}$ , $n\geq 1$ , is an $\mathbb{R}^m$ -valued stationary process in the strong sense. The stochastic representations $J(\theta,X_{n})$ are often interpreted as noisy measurements of $U(\theta)$ . In this paper we focus on applications to mathematical finance, described in Section 6 below, where $J(\theta,X_t)$ are functionals of observed economic variables $X_t$ and $\theta$ determines an investor’s portfolio strategy. In that context, stochasticity does not come from measurement errors; rather, it is an intrinsic property of the system. Maximizing U serves to find the best investment policy in an online, adaptive manner.
We study a recursive algorithm employing finite differences, as proposed by Kiefer and Wolfowitz in [Reference Kiefer and Wolfowitz14]. This is a variant of the Robbins–Monro stochastic gradient method [Reference Robbins and Monro19] where, instead of the objective function itself, its gradient is assumed to admit a stochastic representation.
The novelty in our work is that we do not assume differentiability, nor even continuity, of $\theta\to J(\theta,{\cdot})$ , and the sequence $X_n$ may well be dependent as long as it satisfies a mixing condition. The only result in such a setting that we are aware of is in [Reference Laruelle and Pagès16], which, however, studies only almost sure convergence, without a convergence rate. Our purpose is not to find the weakest possible hypotheses but to arouse keen interest in the given problem that may lead to further, more general results. Our work is also a continuation of [Reference Fort, Moulines, Schreck and Vihola7, Reference Chau, Kumar, Rásonyi and Sabanis4], where discontinuous stochastic gradient procedures were treated.
The main theorems are stated in Section 2 and proved in Section 3. Section 4 recalls earlier results that we are relying on. A numerical example is provided in Section 5. We explain the significance of our results for algorithmic trading in Section 6.
2. Set-up and results
For real-valued quantities X, Y, the notation $X=O(Y)$ means that there is a constant $C>0$ such that $|X|\leq CY$ . We will always work on a fixed probability space $(\Omega,\mathcal{F},P)$ equipped with a filtration $\mathcal{F}_{n}$ , $n\in\mathbb{N}$ , such that $\mathcal{F}_{0}=\{\emptyset,\Omega\}$ . A decreasing sequence of sigma-algebras $\mathcal{F}^{+}_{n}$ , $n\in\mathbb{N}$ , is also given, such that, for each $n\in\mathbb{N}$ , $\mathcal{F}_n$ and $\mathcal{F}_n^+$ are independent and $X_n$ is adapted to $\mathcal{F}_n$ . The notation $\mathbb{E}[X]$ refers to the expectation of a real-valued random variable X, while $\mathbb{E}_k[X]$ is a shorthand notation for $\mathbb{E}[X\vert\mathcal{F}_k]$ , $k\in\mathbb{N}$ . $P_k(A)$ refers to the conditional probability $P(A\vert\mathcal{F}_k)$ . We denote by $\unicode{x1D7D9}_A$ the indicator of a set A. The notation $\omega$ refers to a generic element of $\Omega$ . For $r\geq 1$ , we refer to the set of random variables with finite rth moments as $L^r$ . Vertical bars $|{\cdot}|$ denote the Euclidean norm in $\mathbb{R}^k$ , where k may vary according to the context.
For $i=1,\ldots,d$ , let $\textbf{e}_{i}\in\mathbb{R}^d$ denote the vector in which the ith coordinate is 1 and the other coordinates are 0. For two vectors $v,w\in\mathbb{R}^m$ , the relation $v\leq w$ expresses that $v^i\leq w^i$ for all the components $i=1,\ldots,m$ . Let $B_r\,:\!=\,\big\{\theta\in\mathbb{R}^d:\,|\theta|\leq r\big\}$ denote the ball of radius r, for $r\geq 0$ .
Let the function $U\,:\,\mathbb{R}^d\to\mathbb{R}$ have a unique maximum at the point $\theta^{*}\in\mathbb{R}^{d}$ . Consider the following recursive stochastic approximation scheme for finding $\theta^{*}$ :
starting from some initial (deterministic) guess $\theta_0\in\mathbb{R}^d$ , where H is an estimator of the gradient of J, defined as
for all $\theta\in\mathbb{R}^d$ , $x\in\mathbb{R}^m$ , and $c>0$ .
The sequences $(\lambda_k)_{k\in\mathbb{N}}$ and $(c_k)_{k\in\mathbb{N}}$ appearing in (2) will consist of positive real numbers, which are to be specified later. We will distinguish the cases where $\lambda_k$ , $c_{k}$ tend to zero and where they are kept constant, the former being called decreasing-gain approximation and the latter fixed-gain approximation.
Remark 2.1. Our results below could easily be formulated in a more general setting where $J\big(\theta_k+c_k\textbf{e}_i,X_{k+1}(i)\big)$ and $J\big(\theta_k-c_k\textbf{e}_i,X^{\prime}_{k+1}(i)\big)$ , $i=1,\ldots,d$ , are considered with distinct $X_{k+1}(i)$ and $X^{\prime}_{k+1}(i)$ . In the applications that motivate us this is not the case; hence, for reasons of simplicity, we stay in the present setting.
Assumption 2.1. U is continuously differentiable with unique maximum $\theta^*\in\mathbb{R}^d$ . Let $$G\left( \theta \right) = \nabla U\left( \theta \right)$$ . The function G is assumed Lipschitz-continuous with Lipschitz constant $L_G$ .
We assume in the sequel that the function J in (1) has a specific form. Note that though J is not continuous, U can nonetheless be continuously differentiable, by the smoothing effect of randomness.
Assumption 2.2. Let the function J be of the following specific form:
where $l_{i}\,:\,\mathbb{R}^{d}\times\mathbb{R}^{m}\to\mathbb{R}^{d}$ are Lipschitz-continuous (in both variables) for $i=1,\ldots,m_{s}$ , and for some $m_p,m^{\prime}_p\in\mathbb{N}$ ,
with Lipschitz-continuous functions $g^{j}_{i},h_i^j\,:\,\mathbb{R}^d\to\mathbb{R}^m$ . Furthermore, $A_{0}(x)\,:\!=\,\mathbb{R}^{d}\setminus \cup_{i=1}^{m_{s}}A_{i}(x)$ and
for some $D>0$ . The function $l_0$ is twice continuously differentiable, and there are constants $L^{\prime\prime}_1,L^{\prime\prime}_2$ such that
where I is the $d\times d$ identity matrix.
Remark 2.2. Assumption 2.2 implies that $\nabla l_0$ grows linearly, and hence $l_0$ itself is locally Lipschitz with linearly growing Lipschitz coefficient; that is,
with some $L_0>0$ , for all $\theta_1,\theta_2\in\mathbb{R}^d$ .
In plain English, we consider J which is smooth on a finite number of bounded domains (the interior of the constraint sets $A_{i}(x)$ , $i=1,\ldots, m_{s}$ ) but may have discontinuities at the boundaries. Furthermore, J (and hence also U) is required to be quadratic ‘near infinity’ (on $A_{0}(x)$ ).
We briefly explain why such a hypothesis is not restrictive for real-life applications. Normally, there is a compact set Q (e.g. a cube or a ball) such that only parameters from Q are relevant, i.e. U is defined only on Q. Assume it has some stochastic representation
and a unique maximum $\theta^*\in Q$ . Assume that $Q\subset B_D$ for some D. Extend U outside $B_D$ as $U(\theta)=-A|\theta|^2+B$ for suitable A, B. Extend U and J to $B_D\setminus Q$ as well in such a way that U is continuously differentiable, and $U(\theta)<U(\theta^*)$ for all $$\theta \ne {\theta ^*}$$ (see Section 4 of [Reference Chau5] for a rigorous construction of this kind). Set $J\,:\!=\,U$ outside Q. Then our maximization procedure can be applied to this setting for finding $\theta^*$ .
Defining $U=l_0$ to be (essentially) quadratic outside a compact set is one way of solving the problem that such procedures often leave their effective domain Q. Other solutions are resetting (see e.g. [Reference Gerencsér9]) or an analysis of the probability of divergence (see e.g. [Reference Benveniste, Métivier and Priouret2]).
The next assumption postulates that the process X should be bounded and the conditional laws of $X_{k+1}$ should be absolutely continuous with a bounded density.
Assumption 2.3. For each $k\in\mathbb{N}$ ,
for some measurable $p_{k}\,:\,\mathbb{R}^{d}\times\Omega\to\mathbb{R}_{+}$ , and there is a fixed constant F such that $p_{k}(u,\omega)\leq F$ holds for all k, $\omega$ , u. The random variable $X_0$ satisfies $|X_0|\leq K_0$ for some constant $K_0$ .
Note that, by strong stationarity, the process $X_k$ is uniformly bounded under Assumption 2.3.
We will assume a certain mixing property about the process $X_n$ which we recall now. A family of $\mathbb{R}^d$ -valued random variables $Z_{i}$ , $i\in \mathcal{I}$ , is called $L^r$ -bounded for some $r\geq 1$ if $\sup_{i\in \mathcal{I}}\mathbb{E}|Z_i|^r<\infty$ ; here $\mathcal{I}$ may be an arbitrary index set.
For a random field $Y_n(\theta)$ , $n\in \mathbb{N}$ , $\theta\in\mathbb{R}^d$ , bounded in $L^r$ for some $r\geq 1$ , we define, for all $n\in\mathbb{N}$ ,
These quantities clearly make sense also for any $L^r$ -bounded stochastic process $Y_n$ , $n\in\mathbb{N}$ (the essential suprema disappear in this case). $M^n_r(Y)$ measures the (conditional) moments of Y, while $\Gamma_r^n(Y)$ describes its dependence structure (like covariance decay). In particular, one can define $M^n_r(X)$ , $\Gamma^n_r(X)$ . We clearly have $M^n_r(X)\leq K_0$ under Assumption 2.3. The quantities $\Gamma^n_r(X)$ will figure in certain estimates later.
Assumption 2.4. For some $\epsilon>0$ , $\gamma_3^n(\tau,X)=O\big((1+\tau)^{-4-\epsilon}\big)$ , where the constant of $O({\cdot})$ is independent of $\omega$ , $\tau$ , and n. Furthermore,
where the constant of $O({\cdot})$ is independent of n, k.
Both requirements in Assumption 2.4 are about how the effect of the past on the present decreases as we go back farther in time.
Example 2.1. Let $\varepsilon_n$ , $n\in\mathbb{N}$ , be a bounded sequence of independent and identically distributed (i.i.d.) random variables in $\mathbb{R}^m$ with bounded density w.r.t. the Lebesgue measure, and choose $\mathcal{F}_k\,:\!=\,\sigma(\varepsilon_j,\ j\leq k)$ and $\mathcal{F}_k^+\,:\!=\,\sigma(\varepsilon_j,\ j\geq k+1)$ . Then $X_n\,:\!=\,\varepsilon_n$ , $n\in\mathbb{N}$ , satisfies Assumptions 2.3 and 2.4. A causal infinite moving average process whose coefficients decay sufficiently fast is another pertinent example. Indeed, using the argument of Lemma 4.2 of [Reference Chau, Kumar, Rásonyi and Sabanis4], one can show that $X_n\,:\!=\,\sum_{j=0}^{\infty}s_j\varepsilon_{n-j}$ , $n\in\mathbb{N}$ , satisfies Assumption 2.4, where the $\varepsilon_i$ are as above, $${s_0} \ne 0$$ , and $|s_j|\leq (1+j)^{-\beta}$ holds for some $\beta>9/2$ . Assumption 2.3 is also clearly satisfied in that model.
Remark 2.3. A random field $Y_n(\theta)$ , $n\in \mathbb{N}$ , is called uniformly conditionally L-mixing if $Y_n(\theta)$ is adapted to the filtration $\mathcal{F}_n$ , $n\in\mathbb{N}$ , for all $\theta$ , and the sequences $M_r^n(Y)$ , $\Gamma_r^n(Y)$ , $n\in\mathbb{N}$ , are bounded in $L^r$ for each $r\geq 1$ . Our Assumption 2.4 thus requires a sort of related mixing property. Conditional L-mixing was introduced in [Reference Chau, Kumar, Rásonyi and Sabanis4], inspired by [Reference Gerencsér8].
2.1. Decreasing-gain stochastic approximation
The usual assumption on the sequences $(\lambda_k)_{k=1,2,\dots}$ and $(c_k)_{k=1,2,\dots}$ in the definition of the recursive scheme (2) are the following (see [Reference Kiefer and Wolfowitz14]):
In the sequel we stick to a more concrete choice which clearly fulfills the conditions in (4) above.
Assumption 2.5. We fix $\lambda_0,c_0>0$ , $\gamma\in (0,1/3)$ , and set
and $c_k=c_0 k^{-\gamma}$ , $k\geq 1$ . We also assume $c_0\leq 1$ .
Asymptotically $\lambda_{k}$ behaves like $\lambda_{0}/k$ . However, our choice somewhat simplifies the otherwise already involved theoretical analysis.
The ordinary differential equation (ODE) associated with the problem is
The idea of using an associated deterministic ODE to study the asymptotic properties of recursive schemes was introduced by Ljung in [Reference Ljung17]. The intuition behind this association is that in the long run the noise effects average out and the asymptotic behavior is determined by this ‘mean’ differential equation. A heuristic connection between the dynamics of the recursive scheme and the ODE can be seen if one looks at the Euler discretization of the latter.
The solution of (5) with initial condition $y_s=\xi$ will be denoted by $y(t,s,\xi)$ for $0< s\leq t$ .
Assumption 2.6. The ODE (5) fulfills the stability assumption formulated below: there exist $C^{*}>0$ and $\alpha>0$ such that
for all $0<s<t$ .
Our main result comes next.
Theorem 2.1. Let Assumptions 2.1, 2.2, 2.3, 2.4, 2.5, and 2.6 hold. Then $\mathbb{E}|\theta_n-\theta^{*}|=O\big(n^{-\chi}+n^{-\alpha}\big)$ , $n\geq 1$ , where $\chi =\min\!\big\{\frac{1}{2} - \frac{3}{2}\gamma,\gamma\big\}$ and the constant in $O({\cdot})$ depends only on $\theta_0$ .
To get the best result, set $\gamma=\frac{1}{5}$ . In this case the convergence rate is $\chi=\frac{1}{5}$ (provided that $\alpha\geq 1/5$ ). For Kiefer–Wolfowitz procedures, [Reference Sacks20] establishes a convergence rate $n^{-1/3}$ under fairly restrictive conditions (e.g. J is assumed smooth and X is i.i.d.). Our approach is entirely different from that of [Reference Sacks20] and relies on the ODE method (see e.g. [Reference Kushner and Clark15]) in the spirit of [Reference Gerencsér9, Reference Gerencsér11, Reference Gerencsér10], where so-called SPSA procedures were analyzed.
Theoretical analysis in the present case is much more involved for two reasons: the discontinuities of J and the state-dependent setting (hardly analyzed in the literature at all). Our results are closest to those of [Reference Gerencsér10], where a rate of $n^{-2/7}$ is obtained for the SPSA algorithm (a close relative of Kiefer–Wolfowitz), with strong smoothness assumptions imposed on J. As already remarked, in the absence of smoothness ours is the first study providing a convergence rate. Further strengthening of our result seems to be difficult and will be the object of further investigations.
We also point to two related papers where stochastic gradient procedures are analyzed: [Reference Fort, Moulines, Schreck and Vihola7] treats the Markovian case while [Reference Chau, Kumar, Rásonyi and Sabanis4] is about possibly non-Markovian settings. In these studies, the gradient of J is assumed to exist, but it may be discontinuous. Some of the ideas of [Reference Chau, Kumar, Rásonyi and Sabanis4] apply in the present, more difficult case where even the continuity of J fails.
2.2. Fixed-gain stochastic approximation
Let us also consider a modified recursive scheme
where a and c are fixed (small) positive reals, independent of k. In contrast with the previous scheme (2), which is meant to converge to the maximum of the function, this method is expected to track the maximum.
The ODEs associated with the problem are
for each $\lambda>0$ .
Note that, by an exponential time change, one can show that Assumption 2.6 on the ODE (5) implies (7) being exponentially stable, i.e. satisfying
for some $\alpha>0$ (possibly different from the one in (5)).
Theorem 2.2. Let Assumptions 2.1, 2.2, 2.3, 2.4, and 2.6 hold. Then $\mathbb{E}|\theta_n-\theta^{*}|=O\!\left(\max\!\left(c^2,\sqrt{\frac{a}{c}}\right)+ e^{-a\alpha n}\right)$ holds for all $n\geq 1$ , where the constant in $O({\cdot})$ depends only on $\theta_{0}$ .
Note that, similarly to the decreasing-gain setting, this leads to the best choice being $c=a^{\frac{1}{5}}$ . We know of no other papers where the fixed-gain case has been treated. In the case of stochastic gradients there are many such studies obtaining a rate of $\sqrt{a}$ for step size a; see e.g. [Reference Chau, Kumar, Rásonyi and Sabanis4] and the references therein.
3. Proofs
The following lemma will play a pivotal role in our estimates: it establishes the conditional Lipschitz-continuity of the difference function obtained from J.
Lemma 3.1. Under Assumptions 2.2 and 2.3, there is $C_{\flat}>0$ such that, for each $i=1,\ldots,d$ and $c\leq 1$ ,
holds for all $k\in\mathbb{N}$ and for all pairs of $\mathcal{F}_{k}$ -measurable $\mathbb{R}^{d}$ -valued random variables $\bar{\theta}_{1}$ , $\bar{\theta}_{2}$ .
Proof. We assume that $m_{s}=1$ , $m_{p}=1$ , $m^{\prime}_p=0$ . We will briefly refer to the general case later. We thus assume that $J(\theta,x)=l_{1}(\theta,x)\unicode{x1D7D9}_{\{x\leq g(\theta)\}}+l_{0}(\theta) \unicode{x1D7D9}_{A_{0}(x)}$ with some Lipschitz-continuous $g,l_{1}$ with Lipschitz constant $L_{1}$ (for both). Let $K_{1}$ be an upper bound for $l_{1}$ in $B_{D+2}$ .
Consider first the event $A_{1}\,:\!=\,\big\{\bar{\theta}_{1},\bar{\theta}_{2}\in B_{D+1}\big\}$ and the corresponding indicator $I_1\,:\!=\,\unicode{x1D7D9}_{A_1}$ . Note that on $I_1$ we have $\bar{\theta}_j\pm c\textbf{e}_i\in B_{D+2}$ , $j=1,2$ . Now estimate
In the same way, we also get
As $l_0$ is clearly Lipschitz on $B_{D+2}$ , we also have
Let $L^{\prime\prime}_2$ be an upper bound for the second derivative $\nabla\nabla l_{0}$ ; recall Assumption 2.2. Now let $A_{2}$ be the event that the line from $\bar{\theta}_{1}$ to $\bar{\theta}_{2}$ does not intersect $B_{D+1}$ at all; let $I_2\,:\!=\,\unicode{x1D7D9}_{A_2}$ . It follows in particular that neither $\bar{\theta}_1\pm c\textbf{e}_i$ nor $\bar{\theta}_2\pm c\textbf{e}_i$ fall into $B_D$ . Since $J=l_{0}$ outside $B_{D}$ we can write, by the Lagrange mean value theorem,
with some random variables $\xi_{j}\in \big[\bar{\theta}_{j}-c\textbf{e}_{i},\bar{\theta}_{j}+c\textbf{e}_{i}\big]$ , $j=1,2$ , remembering our assumptions on $l_{0}$ and $c\leq 1$ .
Turning to the event $\Omega\setminus (A_1\cup A_2)$ , let us consider the directed straight line from $\bar{\theta}_{1}(\omega)$ to $\bar{\theta}_{2}(\omega)$ ; let its first intersection point with the boundary of $B_{D+1}$ be denoted by $\kappa_{1}(\omega)$ and its second intersection point by $\kappa_{2}(\omega)$ . In the case where there is only one intersection point, it is denoted by $\kappa_{1}(\omega)$ . Let $I_{3}$ be the indicator of the event that there is only one intersection point ( $\kappa_{1}$ ) with $B_{D+1}$ and that $\bar{\theta}_{1}$ is inside $B_{D+1}$ . The arguments of the previous two cases guarantee that
Similarly, if $I_{4}$ is the indicator of the event where there is one intersection point and $\bar{\theta}_{2}$ is inside $B_{D+1}$ , then we also get
Let $I_{5}$ denote the indicator of the case where both $\bar{\theta}_{1}$ , $\bar{\theta}_{2}$ are outside $B_{D+1}$ and there are two intersection points $\kappa_{1}$ , $\kappa_{2}$ . We get, as above,
Finally, in the remaining case (where there is only one intersection point with $B_{D+1}$ though both $\bar{\theta}_{1}$ , $\bar{\theta}_{2}$ are outside $B_{D+1}$ ), we similarly get an estimate of the order $O\big(|\bar{\theta}_1-\bar{\theta}_2|+c^2\big)$ , and hence we eventually obtain the statement of the lemma.
When $m_p=0$ and $m^{\prime}_p=1$ , the same ideas work. When $m_p+m^{\prime}_p>1$ we can rely on the elementary observation that
and on its counterpart for the $h^j$ . Estimates can be repeated for each summand in the definition of J, so the case $m_s>1$ follows, too.
The arguments of the previous lemma, (8) in particular, also give us the following.
Lemma 3.2. Under Assumptions 2.2 and 2.3, there is such that, for each $i=1,\ldots,d$ ,
holds for all $k\in\mathbb{N}$ and for all $\mathcal{F}_{k}$ -measurable $B_{D+1}$ -valued random variables $\bar{\theta}$ .
3.1. Moment estimates
In this subsection, we will prove that the first moments of our iteration scheme remain bounded. This will be followed by other moment estimates. We start with a preliminary lemma on deterministic sequences.
Lemma 3.3. Let $x_{k}\geq 0$ , $k\in\mathbb{N}$ , be a sequence; let $\zeta_{k}>0$ , $k\geq 1$ , be another sequence. If they satisfy $u\zeta_k<1$ , $k\geq 1$ , and
with some $\underline{c},u>0$ , then
Proof. Following the argument of Lemma 1 in [Reference Durmus and Moulines6], we notice that
where an empty product is meant to be 1. We can write
This shows the claim.
Certain calculations are easier to carry out if we consider the continuous-time embedding of the discrete-time processes. Consider the following extension ${\theta}_t$ , $t\in\mathbb{R}_+$ , of $\theta_k$ , $k\in\mathbb{N}$ : let
for all $k\in\mathbb{N}$ and for all $k\leq t<k+1$ , where $H(u,\theta)=H(\theta,X_{k+1},c_k)$ for all $k\in\mathbb{N}$ , and for all $k\leq u<k+1$ , $c_u=c_k$ and $a_{u}=\lambda_{0}/\max\{u,1\}$ , $u\geq 0$ . Extend the filtration to continuous time by $\mathcal{F}_t\,:\!=\,\mathcal{F}_{\lceil t\rceil}$ , $t\in\mathbb{R}_+$ . Now fix $\mu>1$ . We introduce an auxiliary process that will play a crucial role in later estimates. For each $n\geq 1$ and for $\lceil n^{\mu}\rceil\leq t<\lceil (n+1)^{\mu}\rceil$ , define $\overline{y}_t\,:\!=\,y\big(t,\lceil n^{\mu}\rceil,\theta_{\lceil n^{\mu}\rceil}\big)$ , i.e. the solution of (5) starting at $\lceil n^{\mu}\rceil$ with initial condition $\overline{y}_{\lceil n^{\mu}\rceil}=\theta_{\lceil n^{\mu}\rceil}$ .
We introduce the $L^1$ -norm
for each $\mathbb{R}^d$ -valued random variable Z.
Lemma 3.4. Under Assumptions 2.2 and 2.3, we have
Proof. Note that $2c_{k}H^{j}(\theta,x,c_{k})=l_{0}\big(\theta+c_{k}\textbf{e}_{j}\big)-l_{0}\big(\theta-c_{k}\textbf{e}_{j}\big)$ , for all x, $j=1,\ldots,d$ , when . Furthermore, the function $l_{0}\big(\theta+c_{k}\textbf{e}_{j}\big)-l_{0}\big(\theta-c_{k}\textbf{e}_{j}\big)$ is Lipschitz on $B_{D+1}$ , which together with Lemma 3.2 implies
for a fixed constant $\,\bar{\!C}$ . Clearly,
Note that, by Assumption 2.2, $l_0$ is strongly convex; in particular,
for all $\theta$ , with some $A_0>0$ . Hence also
for suitable $A,B>0$ . But then for all $a>0$ small enough,
for suitable $A^{\prime},B^{\prime}>0$ . By the mean value theorem,
for some random variable $\xi_j\in\big[\theta_{k}-c_{k}\textbf{e}_{j},\theta_{k}+c_{k}\textbf{e}_{j}\big]$ . Since $\nabla l_0$ is Lipschitz,
for some $L^{\prime}>0$ . It then follows easily that, for $k\geq k_0$ large enough so that $\lambda_k$ is small enough,
holds. By (9),
Apply Lemma 3.3 with the choice $x_{k}\,:\!=\,||\theta_{k}||_{1}$ , $\underline{c}\,:\!=\,d (L^{\prime}+\,\bar{\!C})+B^{\prime}$ and $\zeta_{k}\,:\!=\,\lambda_{k}$ , $u\,:\!=\,A^{\prime}$ to obtain that $\sup_{k\geq k_0}\|\theta_{k}\|_{1}<\infty$ . Then trivially also $\sup_{n\in\mathbb{N}}\|\theta_{n}\|_{1}<\infty$ holds, which easily implies $\sup_{t\geq 1}\|\theta_{t}\|_{1}<\infty$ as well.
Now turning to $\overline{y}_{t}$ we see that, for $n\geq 1$ and $\lceil n^{\mu}\rceil\leq t<\lceil (n+1)^{\mu}\rceil$ ,
finishing the proof.
Lemma 3.5. Let Assumptions 2.2 and 2.3 hold. Then there exists $C_l>0$ such that $\sup_{k\geq 1}|J(\theta,X_k)|\leq C_l\big(1+\vert\theta\vert^2\big)$ .
Proof. Recall that
where the functions $l_i$ are bounded on the bounded sets $\cup_{x\in\mathbb{R}^d} A_i(x)$ for $i=1,\ldots,d$ , and $l_0$ grows quadratically.
The difficulty of the following lemma consists in handling the discontinuities and the dependence of the sequence $X_{k}$ at the same time.
Lemma 3.6. Let Assumptions 2.2 and 2.3 hold. Then for each $R>0$ the random field $J(\theta,X_n)$ , $\theta\in B_R$ , $n\in\mathbb{N}$ , satisfies
for some $L>0$ , where $C_l$ is as in Lemma 3.5.
Proof. The first statement is clear from Lemma 3.5. Let $n\geq 0$ , $\tau\geq 1$ be fixed. For $k\geq\tau$ , define $X_k^+ = \mathbb{E}\big[X_{n+k}\vert \mathcal{F}_{n+k-\tau}^+\vee\mathcal{F}_{n}\big]$ . For the sake of simplicity, we assume that $m_s=1$ in the definition of J, $m_p=0$ , but the same argument would work for several summands, too. We also take the process X unidimensional ( $m\,:\!=\,1$ ), noting that the same arguments easily carry over to a general m.
We now perform an auxiliary estimate. Let $\epsilon_{\tau}>0$ be a parameter to be chosen later and let $1\leq j\leq m^{\prime}_{p}$ . We will write h below instead of $h_{1}$ . Define $Z_{k}=X_{n+k}-X_{k}^{+}$ and estimate
where the last inequality follows from Assumption 2.3 and the Markov inequality. Now estimate
for some $C_{1}$ , where we used the Lipschitz-continuity of the function $l_1$ , as well as the observation that
A similar estimate works for $l_0$ , but we get the upper bound
instead. For the second inequality of the present lemma, note first that Lemma 4.1 below implies
hence it suffices to estimate the latter quantity. From our previous estimates it follows that, for some $C>0$ ,
Choose $\epsilon_{\tau}\,:\!=\,(\tau+1)^{-3-\epsilon/2}$ . Summing up the right-hand side for $\tau\geq 1$ we see that, by Assumption 2.4, the sum has an upper bound independent of k. The statement follows as the case $\tau=0$ is easy.
3.2. Decreasing-gain case
The following lemma contains the core estimates of the present paper.
Lemma 3.7. Let $n\geq 1$ . Let $\lceil n^{\mu}\rceil\leq t<\lceil (n+1)^{\mu}\rceil$ for $\mu\,:\!=\,1/\gamma$ and let Assumptions 2.1, 2.2, 2.3, 2.4, 2.5, and 2.6 hold. Then $\mathbb{E}|\theta_t-\overline{y}_t|=O\big(n^{-\beta }\big)$ , where $\beta =\min\!\left(\frac{1}{2\gamma}-\frac{1}{2}, 2\right)$ .
Proof. For $\lceil n^{\mu}\rceil\leq t< \lceil (n+1)^{\mu}\rceil$ ,
Estimation of $\Sigma_0$ . Since G has at most linear growth, Lemma 3.4 guarantees that
Estimation of $\Sigma_1$ . Recall that, by the tower property for conditional expectations,
for all $k\in\mathbb{N}$ . Applying this observation to $k=\lfloor u\rfloor$ , Lemma 3.1 implies that
Henceforth we will write
Notice that
Estimation of $\Sigma_2$ . Notice that $H(u,\bar{\theta})=\mathbb{E}\big[H(u,\bar{\theta})| \mathcal{F}_{\lceil n^{\mu}\rceil} \big]$ for all $\mathcal{F}_{\lceil n^{\mu}\rceil}$ -measurable $\bar{\theta}$ such that, almost surely, , since $J(\theta,x)$ does not depend on x outside $B_{D}$ by Assumption 2.2. Thus
We will use the inequality of Theorem 4.1 below with $r=3$ , with $\mathcal{R}_{t}\,:\!=\,\mathcal{F}_{t+\lceil n^{\mu}\rceil}$ , $t\in\mathbb{R}_{+}$ , $\mathcal{R}_t^+\,:\!=\,\mathcal{F}_{t+\lceil n^{\mu}\rceil}^+$ , with the process defined by
and with the function $f_t={a_{t+\lceil n^{\mu}\rceil}}/{c_{t+\lceil n^{\mu}\rceil }}$ . Note that $\{\overline{y}_{t}\in B_{D}\}\in\mathcal{F}_{\lceil n^{\mu}\rceil}$ for all $\lceil n^\mu\rceil\leq t<\lceil (n+1)^\mu\rceil$ . We get from Lemma 4.2 below and from the cited inequality that
We thus get
Estimation of $\Sigma_3$ . We have
To handle the second sum, note that, for each $i=1,\ldots,d$ ,
for some $\xi^{i}_{k}\in [\vartheta - c_k\textbf{e}_{i},\vartheta + c_k\textbf{e}_{i}]$ . The Lipschitz-continuity of G implies that $|G^{i}\big(\xi^{i}_k\big)- G^{i}(\vartheta)|\leq L_G c_k$ , so
Now we turn to the first sum in (13). Define $X_k^+ = \mathbb{E}\big[X_{k}\vert \mathcal{F}_{\lceil n^{\mu}\rceil}^+\big]$ , $k\geq \lceil n^{\mu}\rceil$ . First let us estimate
Fix $\epsilon_{k}>0$ to be chosen later. By an argument similar to that of Lemma 3.6 (using the first instead of the third moment in Markov’s inequality) we get that, for some constant $C_{1}$ ,
Choose $\epsilon_k=(1+k)^{-1-\epsilon/2}$ . Then using Assumption 2.4 we get
which also implies
Since
for $k\geq \lceil n^{\mu}\rceil$ by independence of $\mathcal{F}_{\lceil n^{\mu}\rceil}$ and $\mathcal{F}_{\lceil n^{\mu}\rceil}^{+}$ , we have
with some $C_{2}$ so
Combining the estimates we have so far, we get
Notice that $\mathbb{E}\vert \theta_t - \overline{y}_t\vert$ is always finite; see Lemma 3.4 above. Use Gronwall’s lemma and (11) to obtain the inequality
with some constant $C_{3}$ . From Lemma 3.2 it is also easy to check that $\mathbb{E}\vert \theta_t - \theta_{\lfloor t\rfloor}\vert=O(n^{-\mu})$ . Note furthermore that the terms $n^{-\mu}$ and $n^{\mu(\gamma-1)}$ are always negligible in (14). These observations lead to
with some $C_{4}$ , finishing the proof.
Proof of Theorem 2.1. Let
By Fatou’s lemma, we also have
where $\overline{y}_{s-}$ denotes the left limit of $\overline{y}$ at s.
It follows from Lemma 3.7 that $d_i=O\big(i^{-\beta}\big)$ . Combining this with Assumption 2.6 and using telescoping sums we get, for each integer $N\geq 1$ ,
noting that $y(\lceil i^{\mu}\rceil,\lceil(i-1)^{\mu}\rceil,\theta_{ \lceil (i-1)^{\mu}\rceil})$ equals the left limit $\overline{y}_{\lceil i^{\mu}\rceil -}$ . A similar argument provides, for all $t\in(\lceil N^{\mu}\rceil ,\lceil (N+1)^{\mu}\rceil)$ ,
Taking the $\mu$ th root we obtain
To conclude, note that by the stability Assumption 2.6, $|y(t,1,\theta_{1})-\theta^{*}|\leq C^{*}|\theta_1-\theta^{*}|t^{-\alpha}$ and that $\mathbb{E}|\theta_1|<\infty$ , as easily seen using Lemma 3.2.
3.3. Fixed-gain stochastic approximation
Define $T=\frac{c}{a}$ . For $nT\leq t<(n+1)T$ , define $\overline{y}_t=y(t,nT,\theta_{nT})$ , i.e. the solution of (5) with the initial condition $y_{nT}=\theta_{nT}$ . We use the piecewise linear extension $\overline{\theta}_t$ of $\theta_t$ and the piecewise constant extension $H(t,\theta)$ of $H(\theta,X_{k+1},c)$ as defined in the decreasing-gain setting, but a and c are now constants.
Lemma 3.8. Let Assumptions 2.1, 2.2, 2.3, 2.4 and 2.6 hold. Then for $t\in[nT,(n+1)T]$ there is $\overline{C}>0$ such that $\mathbb{E}|\theta_t-\overline{y}_{t}|\leq \overline{C} \max\!\Big(c^2,\sqrt{\frac{a}{c}}\Big)$ .
Proof. Using essentially the same estimates we derived in the decreasing-gain setting, for fixed a and c we get
with suitable constants $C_{0},C_{1},C_{2},C_{3}$ . Combine these estimates and use Gronwall’s lemma to get the statement. To choose optimally, set $c^2=\sqrt{\frac{a}{c}}$ , that is, $c=a^{\frac{1}{5}}$ . In this case $\mathbb{E}\vert\theta_t-\overline{y}_t\vert\leq C_4 a^{\frac{2}{5}}$ for some $C_4$ .
Proof of Theorem 2.2. Let
It follows from Lemma 3.8 that $d_i\leq \overline{C} \max\!\Big(c^2,\sqrt{\frac{a}{c}}\Big)$ . Combining this with Assumption 2.6 and using telescoping sums we get
with some $\hat{C}$ , since $\sum_{i=2}^N e^{-a\alpha(NT-iT)}$ has an upper bound independent of N. We similarly get
with some $\check{C}$ . To conclude, note that by the stability Assumption 2.6, $|y(t,1,\theta_{1})-\theta^{*}|\leq C^{*}|\theta_1-\theta^{*}|e^{-a\alpha t}$ and therefore
4. Auxiliary results
We define continuous-time analogues of the key quantities M and $\Gamma$ from Assumption 2.4 and establish a pivotal maximal inequality for them.
Consider a continuous-time filtration $(\mathcal{R}_t)_{t\in\mathbb{R}_+}$ as well as a decreasing family of sigma-fields $\big(\mathcal{R}_t^+\big)_{t\in\mathbb{R}_+}$ . We assume that $\mathcal{R}_t$ is independent of $\mathcal{R}_t^+$ , for all $t\in\mathbb{R}_+$ .
We consider an $\mathbb{R}^{d}$ -valued continuous-time stochastic process $(W_{t})_{t\in\mathbb{R}_+}$ which is progressively measurable (i.e. $W\,:\,[0,t]\times\Omega\to\mathbb{R}^d$ is $\mathcal{B}([0,t])\otimes\mathcal{R}_t$ -measurable for all $t\in\mathbb{R}_+$ ).
From now on we assume that $W_{t}\in L^{1}$ , $t\in\mathbb{R}_{+}$ . Fix $r\geq 1$ . We define the quantities
and set $\tilde{\Gamma}_r \,:\!=\, \sum_{\tau=0}^{\infty}\tilde{\gamma}_r(\tau)$ .
Now we recall a powerful maximal inequality, Theorem B.3 of [Reference Barkhagen1].
Theorem 4.1. Let $(W_{t})_{t \in \mathbb{R}_{+}}$ be $L^{r}$ -bounded for some $r> 2$ and let $\tilde{M}_{r}+\tilde{\Gamma}_{r}<\infty$ almost surely. Assume $\mathbb{E}[{W_{t}}\vert{\mathcal{R}_{0}}]=0$ almost surely for $t\in\mathbb{R}_+$ . Let $f\,:\,[0,T]\to\mathbb{R}$ be $\mathcal{B}([0,T])$ -measurable with $\int_{0}^{T}f_{t}^{2}\, d t<\infty$ . Then there is a constant C′(r) such that
almost surely.
We also recall Lemma A.1 of [Reference Chau, Kumar, Rásonyi and Sabanis4].
Lemma 4.1. Let $\mathcal{G},\mathcal{H}\subset\mathcal{F}$ be sigma-algebras. Let $X,Y\in\mathbb{R}^{d}$ be random variables in $L^p$ such that Y is measurable with respect to $\mathcal{H}\vee\mathcal{G}$ . Then for any $p\ge 1$ ,
Lemma 4.2. Let the process $W_{t}$ be defined by (12). Taking the filtration $\mathcal{R}_{t}\,:\!=\,\mathcal{F}_{t+\lceil n^{\mu}\rceil}$ and $\mathcal{R}^+_t\,:\!=\,\mathcal{F}_{t+\lceil n^{\mu}\rceil}^+$ , we get $\tilde{M}_{r}+\tilde{\Gamma}_{r}\leq C\big(1+D^{2}\big)$ for some $C>0$ .
Proof. Estimations of Lemma 3.6 with $R=D$ and with $\overline{y}_{t+\lceil n^{\mu}\rceil}$ instead of $\theta$ imply the statement.
5. Numerical experiments
In what follows we present numerical results to check the convergence of the algorithm for a simple discontinuous function J, defined as
where X is a square-integrable, absolutely continuous random variable. Clearly, this function is not continuous in the parameter, but its expectation is continuous:
where $f({\cdot})$ and $F({\cdot})$ are respectively the density function and the distribution function of X. Assuming that F is differentiable, we need to solve
in order to find $\textrm{arg}\,\textrm{min}\ \mathbb{E} J(X,\theta)$ .
For the numerical examples we will use the recursion
To compute the expected error, Monte Carlo simulations were used with 10000 sample paths and the number of steps k ranging from $2^{8}$ to $2^{20}$ . We fit regression on the log–log plot to get the convergence rate only on $\big[2^{13},2^{20}\big]$ and set $k_0=10000$ to avoid the initial fluctuations of the algorithm.
5.1. Independent innovations
In this section we assume that the consecutive ‘measurement noises’ $X_{n}$ are i.i.d. We consider three different choices for the distribution of the noise: the normal, uniform, and beta distributions. Note that the normal distribution violates boundedness, and for the uniform distribution the differentiability of F fails; however, convergence is achieved even in these cases. We also distinguish between the case where the observations $X_{k+1}$ and $X^{\prime}_{k+1}$ are the same and the case where they are independent. Here we refer back to Remark 2.1, where we point out that this choice does not influence our theoretical results, although it may make a visible difference numerically. This phenomenon has already been observed; see [Reference Glasserman and Yao12] for more about the variance reduction technique called common random numbers (CRN). The values in Table 1 below represent the slope of the linear regression we fit on the log–log plot of the average absolute error versus the number of steps, together with the R-squared value measuring the goodness of the fit.
The lower limit that we theoretically achieved for the convergence rate in Theorem 2.1 was ${-0.2}$ ; however, the numerical experiments we present show that the practical convergence rate can outperform this.
5.1.1. Standard normal distribution.
Assume that $X\sim N(0, 1)$ . Then the function whose minimum we aim to find is $U_1(\theta) = 1+\theta^2+\Phi(\theta),$ where $\Phi$ denotes the cumulative distribution function of the standard normal distribution. We get the solution $\theta^{*} =-\sqrt{W\left(\frac{1}{8\pi}\right)}\approx -0.19569,$ where W is the Lambert W function.
Figure 1 illustrates the convergence of two variations of the algorithm (20) for $U_1$ , starting the iteration from $\theta_0=-0.1$ . In Figure (1a) we present the case where $X_{k+1}$ and $X^{\prime}_{k+1}$ are independent on a log–log plot; we observe a convergence rate of $k^{-0.299}$ . Figure (1b) shows the case where $X_{k+1}=X^{\prime}_{k+1}$ , which yields a convergence rate of $k^{-0.459}$ .
5.1.2. Uniform distribution on $[0, 1]$ .
Let $X\sim U([0,1])$ . Then the function whose minimum we aim to find is $U_2(\theta) = 1/3-\theta+\theta^2+F_{uni}(\theta),$ where $F_{uni}$ denotes the cumulative distribution function of the Uniform([0, 1]) distribution. We get the solution $\theta^{*} =0.$
Figure 2 illustrates the convergence of two variations of the algorithm (20) for $U_2$ , starting the iteration from $\theta_0=1$ . In Figure (2a) we present the case where $X_{k+1}$ and $X^{\prime}_{k+1}$ are independent on a log–log plot, while (2b) shows the case where $X_{k+1}=X^{\prime}_{k+1}$ . Both cases yield a convergence rate of $k^{-0.14}$ , which is worse than the theoretical rate $k^{-0.2}$ .
5.1.3. Beta(2, 2) distribution.
Let $X\sim Beta(2,2)$ . Then the function whose minimum we aim to find is $U_3(\theta) = 0.3-\theta+\theta^2+F_{\beta}(\theta),$ where $F_{\beta}$ denotes the cumulative distribution function of the Beta(2,2) distribution. We get the solution $\theta^{*} =\frac{2-\sqrt{2.5}}{3}\approx 0.13962.$
Figure 3 illustrates the convergence of two variations of the algorithm (20) for $U_3$ , starting the iteration from $\theta_0=1$ . In Figure (3a) we present the case where $X_{k+1}$ and $X^{\prime}_{k+1}$ are independent on a log–log plot; we observe a convergence rate of $k^{-0.374}$ . Figure (3b) shows the case where $X_{k+1}=X^{\prime}_{k+1}$ , which yields a convergence rate of $k^{-0.393}$ .
5.2. AR(1) innovations
For an example with non-i.i.d. $X_{t}$ , assume that the ‘noise’ is an AR(1) process defined as
where $\varepsilon_t$ is standard normal for $t\in\mathbb{Z}$ and $\vert\kappa\vert<1$ . Clearly, $Y_{t}= \sum_{k=0}^\infty \kappa ^k\varepsilon_{t-k},$ and therefore $Y_t\sim N\!\left(0, \frac{1}{1-\kappa^2}\right)$ . For the sequences $X_t$ and $X^{\prime}_t$ we have two options: either we take consecutive measurements, i.e. $X_{k}=Y_{2k-1}$ and $X^{\prime}_{k}=Y_{2k}$ , or we use identical values, i.e. $X_{k}=X^{\prime}_{k}=Y_k$ . In both cases,
Solving this for $\kappa=0.75$ , we get the optimal value $\theta^{*}\approx -0.13144.$
Figure 4 and Table 2 illustrate the convergence rate of the algorithm (20) for the function $U_4$ , starting from $\theta_0=0$ . In Figure (4a) we present the rate in the case where we take consecutive measurements of the AR(1) process ( $X_{k}=Y_{2k-1}$ and $X^{\prime}_{k}=Y_{2k}$ ); the convergence rate of $k^{-0.333}$ is observed. Figure (4b) shows the case where the two measurements are the same $\big(X_{k}=X^{\prime}_{k}=Y_k\big)$ , with the rate $k^{-0.487}$ .
6. Application to mathematical finance
The price of a financial asset either follows a trend during a given period of time or just rambles around its ‘fair’ price value—at least, so it seems to many actual traders. This ‘rambling’, in more mathematical terms, means that the price is reverting to its long-term average. Such a mean-reversion phenomenon can be exploited by ‘buying low, selling high’-type strategies. Discussions on this topic involve plenty of common-sense advice and benevolent concrete suggestions; see e.g. [Reference TraderGav.com22–Reference Trading24]. There also exist theoretical studies about optimal trading with such prices; see e.g. [Reference Guasoni, Tolomeo and Wang13]. However, a rigorous approach to adaptive trading algorithms of this type is lacking.
The results of the present paper provide theoretical convergence guarantees for such algorithms which cannot be deduced from the existing literature on stochastic approximation. The most conspicuous feature of mean-reversion strategies is that they are triggered when the price reaches a certain level. This means that their payoffs are discontinuous with respect to the parameters; gradients do not exist, and only finite-difference approximations can be used (the Kiefer–Wolfowitz method). Their convergence in the given discontinuous case cannot be shown based on available results; hence we fill an important and practically relevant gap here.
Below, we describe a trading model in some detail and explain how it fits into the framework used in the previous sections. Let the price of the observed financial asset be described by a real-valued stochastic process $S_{t}$ , $t\in\mathbb{Z}$ , adapted to a given filtration $\mathcal{F}_{t}$ , $t\in\mathbb{Z}$ , representing the flow of information. (Alternatively, $S_{t}$ may be the increment of the price at t, which can safely be assumed to follow a stationary process.)
Our algorithm will be based on several dynamically updated estimators which are assumed to be functionals of the trajectories of $S_{t}$ and possibly of another adapted process $F_{t}$ describing important economic factors. The estimate for the long-term average of the process is denoted by $A_{t}(\theta)$ at time t. The upper and lower bandwith processes will be denoted by $B_{t}^{+}(\theta)$ and $B_{t}^{-}(\theta)$ ; they are non-negative. All these estimates depend on a parameter $\theta$ to be tuned, where $\theta$ ranges over a subset Q of $\mathbb{R}^{d}$ .
In practice, $A_{t}(\theta)$ is some moving average (or exponential moving average) of previous values of S, but it may depend on the other indicators F (market indices, etc.). Here $\theta$ determines, for instance, the weights of the moving average estimate. The quantities $B_{t}^{\pm}(\theta)$ are normally based on standard deviation estimates for S but, again, may be more complex, with $\theta$ describing weighting of past information. If we peek from time t back to time $t-p$ with some $p\in\mathbb{N}$ , then $A_{t}(\theta),B^{\pm}_{t}(\theta)$ are functionals of $(S_{t-p},F_{t-p},\ldots,S_{t},F_{t})$ .
The price range $\big[A_{t}-B^{-}_{t},A_{t}+B^{+}_{t}\big]$ is considered to be ‘normal’ by the algorithm, while quitting that interval suggests ‘extremal’ behavior that the market should soon correct. For example, reaching the level $A_{t}-B^{-}_{t}$ means that the price is abnormally low for the present circumstances; hence it is worth buying a quantity $b(\theta)$ of the asset, where, again, the parameter $\theta$ should be optimally found. When the price returns to $A_{t^{\prime}}$ at some later time t ′, the asset will be sold and a profit realized. Similarly, when the price reaches $A_{t}+B^{+}_{t}$ , a quantity $s(\theta)$ of the asset is sold (the price being abnormally high), and it will be repurchased once the ‘normal’ level $A_{t^{\prime}}$ is reached at some future $t^{\prime}>t$ , with the aim of realizing a profit.
The value of the parameter $\theta$ will be updated at times tN, $t\in\mathbb{N}$ , where $N\geq 1$ is fixed. The (random) profit (or loss) resulting from trading on the interval $[N(t-1),Nt]$ is denoted by $u(\theta,X_{t})$ with $X_{t}=\big(S_{N(t-1)-p},F_{N(t-1)-p},\ldots,S_{Nt},F_{Nt}\big)$ . We could even write an explicit expression for u based on the description of the trading mechanism in the previous paragraph, but it would be very cumbersome without providing additional insight, so we omit it. We also add that, in many cases, a fee must also be paid at every transaction. Such strategies being ‘threshold-type’, the function u is generically a discontinuous function of $\theta$ .
We furthermore argue that one cannot smooth out u and make it continuous without losing essential features of the problem. At first sight it may look reasonable to approximate the indicator function of the interval $[0,\infty)$ by a function f which is 1 on $[0,\infty)$ , 0 on $({-}\infty,-\epsilon]$ for some small $\epsilon>0$ , and linear on $({-}\epsilon,0)$ , but in this way we get a Lipschitz approximation with a huge Lipschitz constant, hence with a poor convergence rate! This is just to stress that although such simple tricks might work in certain practical situations, they only obscure the real issues in the theoretical analysis (namely, there is a discontinuity to be handled).
The algorithm described above is very close to what actual investors do; see [Reference TraderGav.com22, Reference Trading Strategy Guides23, Reference Trading24]. We also mention the related theoretical studies [Reference Cartea, Jaimungal and Penalva3, Reference Leung and Li18], which, however, do not take an adaptive view and calculate optimal strategies for concrete models.
Taking a more realistic, adaptive approach, the investor may seek to maximize $\mathbb{E}u(\theta,X_{0})$ by dynamically updating $\theta$ at every instant tN, $t\in\mathbb{N}$ . Our versions of the Kiefer–Wolfowitz algorithm, presented in the previous sections, are tailor-made for such online optimization, both the decreasing- and the fixed-gain version, depending on the circumstances. Theorems 2.1 and 2.2 provide solid theoretical convergence guarantees for such procedures.
Acknowledgements
Part of this research was performed while the authors were participating in the Simons Semester on Stochastic Modeling and Control at Banach Center, Warsaw, in 2019. During the preparation of this paper the second author attended the PhD school of Central European University, Budapest, and was also affiliated with Eötvös Loránd University, Budapest. We thank all these institutions for their hospitality.
Funding information
Both authors were supported by the Lendület grant LP 2015-6. The second author was also supported by Project No. ED 18-1-2019-0030 (Application-specific highly reliable IT solutions), which was implemented with support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme funding scheme.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process for this article.