Hostname: page-component-cd9895bd7-dk4vv Total loading time: 0 Render date: 2024-12-23T10:52:06.199Z Has data issue: false hasContentIssue false

Optimal scaling of MCMC beyond Metropolis

Published online by Cambridge University Press:  16 December 2022

Sanket Agrawal*
Affiliation:
University of Warwick
Dootika Vats*
Affiliation:
Indian Institute of Technology Kanpur
Krzysztof Łatuszyński*
Affiliation:
Indian Institute of Technology Kanpur
Gareth O. Roberts*
Affiliation:
University of Warwick
*
*Postal address: Coventry CV4 7AL, U.K.
**Email address: [email protected]
*Postal address: Coventry CV4 7AL, U.K.
*Postal address: Coventry CV4 7AL, U.K.
Rights & Permissions [Opens in a new window]

Abstract

The problem of optimally scaling the proposal distribution in a Markov chain Monte Carlo algorithm is critical to the quality of the generated samples. Much work has gone into obtaining such results for various Metropolis–Hastings (MH) algorithms. Recently, acceptance probabilities other than MH are being employed in problems with intractable target distributions. There are few resources available on tuning the Gaussian proposal distributions for this situation. We obtain optimal scaling results for a general class of acceptance functions, which includes Barker’s and lazy MH. In particular, optimal values for Barker’s algorithm are derived and found to be significantly different from that obtained for the MH algorithm. Our theoretical conclusions are supported by numerical simulations indicating that when the optimal proposal variance is unknown, tuning to the optimal acceptance probability remains an effective strategy.

Type
Original Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

Over the past few decades, Markov chain Monte Carlo (MCMC) methods have become an abundantly popular computational tool, enabling practitioners to conveniently sample from complicated target distributions [Reference Brooks, Gelman, Jones and Meng5, Reference Meyn and Tweedie21, Reference Robert and Casella26]. This popularity can be attributed to easy-to-implement accept–reject-based MCMC algorithms for target densities available only up to a proportionality constant. Here, draws from a proposal kernel are accepted with a certain acceptance probability. The choice of the acceptance probability and the proposal kernel can yield varying performances of the MCMC samplers.

Unarguably, the most popular acceptance probability is Metropolis–Hastings (MH), of [Reference Hastings14, Reference Metropolis20], owing to its acknowledged optimality [Reference Billera and Diaconis4, Reference Peskun25]. Efficient implementation of the MH algorithm requires tuning within the chosen family of proposal kernels. For the MH acceptance function, various optimal scaling results have been obtained under assumptions on the proposal and the target distribution. This includes the works of [Reference Bédard3, Reference Neal and Roberts24, Reference Roberts, Gelman and Gilks27, Reference Roberts and Rosenthal28, Reference Roberts and Rosenthal29, Reference Sherlock and Roberts34, Reference Yang, Roberts and Rosenthal40, Reference Zanella, Bédard and Kendall41], among others.

Despite the popularity of the MH acceptance function, other acceptance probabilities remain practically and theoretically relevant. Recently, Barker’s acceptance rule [Reference Barker2] and lazy MH [Reference Łatuszyński and Roberts18] have found use in Bernoulli-factory-based MCMC algorithms for intractable posteriors [Reference Gonçalves, Łatuszyński and Roberts12, Reference Gonçalves, Łatuszyński and Roberts13, Reference Herbei and Berliner15, Reference Smith37, Reference Vats, Gonçalves, Łatuszyński and Roberts39]. Barker’s acceptance function has also proven to be optimal with respect to search efficiency [Reference Menezes and Kabamba19], and it guarantees variance improvements for waste-recycled Monte Carlo estimators [Reference Delmas and Jourdain7]. Further, a class of acceptance probabilities from [Reference Bédard3] has been of independent theoretical interest. We also introduce a new family of generalized Barker’s acceptance probabilities and present a Bernoulli factory for use in problems with intractable posteriors.

To the best of our knowledge, there are no theoretical and practical guidelines concerning optimal scaling outside of MH and its variants (although see [Reference Sherlock, Thiery and Golightly35] for a discussion on delayed-acceptance MH and [Reference Doucet, Pitt, Deligiannidis and Kohn8, Reference Schmon, Deligiannidis, Doucet and Pitt32, Reference Sherlock, Thiery, Roberts and Rosenthal36] for analyses pertaining to pseudo-marginal MCMC). We obtain optimal scaling results for a large class of acceptance functions; Barker’s, lazy MH, and MH are members of this class.

We restrict our attention to the framework of [Reference Roberts, Gelman and Gilks27] with a random-walk Gaussian proposal kernel and a d-dimensional decomposable target distribution. Similar to MH, our general class of acceptance functions require the proposal variance to be scaled by $1/d$ . We find that, typically, for lower acceptance functions, the optimal proposal variance is larger than the optimal proposal variance for MH, implying the need for larger jumps. For Barker’s acceptance rule, the asymptotically optimal acceptance rate (AOAR) is approximately $0.158$ , in comparison to the $0.234$ rate for MH [Reference Roberts, Gelman and Gilks27]. Similar AOARs are presented for other acceptances.

In Section 2 we describe our class of acceptance probabilities, with the main results presented in Section 3. AOARs for Barker’s and other functions are obtained in Section 3.1. In Section 4 we present numerical results in some settings that comply with our assumptions and others that do not. A trailing discussion on the scaling factor for different acceptance functions and generalizations of our results is provided in the last section. All proofs are in the appendices.

2. Class of acceptance functions

Let $\boldsymbol{\pi}$ be the target distribution, with corresponding Lebesgue density $\pi$ and support $\mathcal{X}$ , so that an MCMC algorithm aims to generate a $\boldsymbol{\pi}$ -ergodic Markov chain, $\{X_n\}$ . Let Q be a Markov kernel with an associated Lebesgue density $q(x, \cdot)$ for each $x \in \mathcal{X}$ . We assume throughout that q is symmetric. Furthermore, let the acceptance probability function be $\alpha(x, y)\,:\, \mathcal{X} \times \mathcal{X} \to [0, 1]$ . Starting from an $X_0 \in \mathcal{X}$ , at the nth step, a typical accept–reject MCMC algorithm proposes $y \sim q(X_{n-1}, \cdot)$ . The proposed value is accepted with probability $\alpha(X_{n-1}, y)$ and rejected otherwise, implying that $X_n = X_{n-1}$ . The acceptance function $\alpha$ is responsible for guaranteeing $\boldsymbol{\pi}$ -reversibility and thus $\boldsymbol{\pi}$ -invariance of the Markov chain. Let $a \wedge b$ denote $\min\!(a, b)$ and $s(x,y) = \pi(y) / \pi(x)$ . We define $\mathcal{A}$ , the class of acceptance functions for which our optimal scaling results will hold, as follows.

Definition 1. Each $\alpha \in \mathcal{A}$ is a map $\alpha(x,y)\,:\, \mathcal{X} \times \mathcal{X} \to [0, 1]$ , and for every $\alpha \in \mathcal{A}$ there exists a balancing function $g_{\alpha}\,:\, [0, \infty) \to [0, 1]$ such that

(1) \begin{align}& \alpha(x, y) = g_{\alpha}(s(x, y)), \ \ x, y \in \mathcal{X}, \end{align}
(2) \begin{align}& g_{\alpha}(z) = zg_{\alpha}\!\left( \dfrac{1}{z}\right), \ \ 0 \le z < \infty, \end{align}
(3) \begin{align}& g_{\alpha}(e^z), \ z \in \mathbb{R}, \text{ is Lipschitz continuous.}\end{align}

Properties (1) and (2) are standard and easy to verify, with (1) ensuring that intractable constants in $\pi$ cancel away and (2) ensuring $\boldsymbol{\pi}$ -reversibility. Property (3) is not required for $\alpha$ to be a valid acceptance function; however, we need it for our optimal scaling results (to establish Lemma 4), and it holds true for all common acceptance probabilities. Moreover, each $\alpha \in \mathcal{A}$ can be identified by the corresponding $g_{\alpha}$ , and we will use $\alpha$ and $g_{\alpha}$ interchangeably. If $g_{\text{MH}}$ denotes the balancing function for the MH acceptance function $(\alpha_{\text{MH}})$ , then

(4) \begin{equation}g_{\text{MH}}(z) = 1 \wedge z, \ \ \ z \ge 0.\end{equation}

It is easy to see that $\alpha_{\text{MH}} \in \mathcal{A}$ . The lazy MH $(\alpha_{\text{L}})$ acceptance of [Reference Herbei and Berliner15, Reference Łatuszyński and Roberts18] also belongs to $\mathcal{A}$ . For a fixed $\varepsilon \in [0, 1]$ , it is defined using

(5) \begin{equation}g_{\text{L}}(z) = (1 - \varepsilon)(1 \wedge z), \ \ \ z \ge 0\,.\end{equation}

Barker’s acceptance function is $\alpha_{\text{B}}(x, y) = g_{\text{B}}(s(x, y))$ for all $x, y \in \mathcal{X}$ , where

(6) \begin{equation}g_{\text{B}}(z) = \dfrac{z}{1+z}, \ \ \ z \ge 0.\end{equation}

Then (2) follows immediately. For differentiable functions, Property (3), i.e. Lipschitz continuity of $g_{\alpha}(e^z)$ , can be verified by bounding the first derivative. In particular, the derivative of $g_{\text{B}}(e^z)$ , given by $e^z/(1 + e^z)^2$ , is bounded by $1/4$ for all $z \in \mathbb{R}$ , and hence $\alpha_{\text{B}} \in \mathcal{A}$ . From [Reference Peskun25], it is well known that in the context of Monte Carlo variability of ergodic averages, MH is superior to Barker’s. Even so, Barker’s acceptance function has had a recent resurgence, aided by its use in Bernoulli-factory MCMC algorithms for Bayesian intractable posteriors where MH algorithms are not implementable.

We present a generalization of (6): for $r \geq 1$ define

\begin{equation*}g_r^{\text{R}}(z) = \begin{cases}\displaystyle \frac{z(z^r - 1)}{z^{r+1} - 1}, & z \neq 1, \\[9pt]\displaystyle \frac{r}{r + 1}, & z = 1\,.\end{cases}\end{equation*}

For $r \in \mathbb{N}$ , the above can be rewritten as

(7) \begin{equation}g_r^{\text{R}}(z) = \frac{z + \dots + z^r}{1 + z + \dots + z^r}, \ \quad z \ge 0, \ r \in \mathbb{N}.\end{equation}

If $\alpha_r^{\text{R}}$ is the associated acceptance function, then $\alpha_r^{\text{R}} \in \mathcal{A}$ for all $r \geq 1$ . Moreover, $g_1^{\text{R}} \equiv g_{\text{B}}$ and $g_r^{\text{R}} \uparrow g_{\text{MH}}$ as $r \to \infty$ . For $r \in \mathbb{N}$ , we present a natural Bernoulli factory in the spirit of [Reference Gonçalves, Łatuszyński and Roberts13] that generates events of probability $\alpha_r^{\text{R}}$ without explicitly evaluating it; see Appendix D. An alternative approach would be to follow the general sampling algorithm of [Reference Morina, Łatuszyński, Nayar and Wendland23] for rational functions.

Let $\boldsymbol{\Phi}({\cdot})$ be the standard normal distribution function. For a theoretical exposition, [Reference Bédard3] defines the following acceptance probability for some $h > 0$ :

(8) \begin{equation}g_h^{\text{H}}(z) = \boldsymbol{\Phi} \!\left( \frac{\log z - h/2}{\sqrt{h}}\right) + z \cdot \boldsymbol{\Phi} \!\left( \frac{-\log z - h/2}{\sqrt{h}}\right), \ \ \ z \ge 0.\end{equation}

For each $h > 0$ , $\alpha_h^{\text{H}} \in \mathcal{A}$ , and observe that as $h \to 0$ , $g_h^{\text{H}} \to g_{\text{MH}}$ , while as $h \to \infty$ , $g_h^{\text{H}} \to 0$ ; i.e. the chain never moves. Similar examples can be constructed by considering other well-behaved distribution functions in place of $\boldsymbol{\Phi}$ . Lastly, it is easy to see that $\mathcal{A}$ is convex. Thus, it also includes situations when each update of the algorithm randomly chooses an acceptance probability. Moreover, as evidenced in (5), $\mathcal{A}$ is also closed under scalar multiplication as long as the resulting function lies in [0, 1].

3. Main theorem

Let f be a 1-dimensional density function and consider a sequence of target distributions $\{ \boldsymbol{\pi}_d \}$ such that for each d, the joint density is

\begin{equation*}\pi_d\big(\boldsymbol{{x}}^d\big) = \prod_{i=1}^df\big(x_i^d\big), \ \ \ \ \ \boldsymbol{{x}}^d = \big(x_1^d, \dots, x_d^d\big)^T\in \mathbb{R}^d.\end{equation*}

Assumption 1. The density f is positive and in $C^2$ —the class of all real-valued functions with continuous second-order derivatives. Furthermore, $f^{\prime}/f$ is Lipschitz, and the following moment conditions hold:

(9) \begin{equation}\mathbb{E}_f\!\left[ \left(\frac{f^{\prime}(X)}{f(X)}\right)^8\right] < \infty, \qquad \mathbb{E}_f\!\left[ \left(\frac{f^{\prime\prime}(X)}{f(X)}\right)^4\right] < \infty.\end{equation}

Consider the sequence of Gaussian proposal kernels $\big\{ Q_d\big(\boldsymbol{{x}}^d, \cdot\big) \big\}$ with associated density sequence $\{q_d\}$ , so that $Q_d\big(\boldsymbol{{x}}^d, \cdot\big) = N\big(\boldsymbol{{x}}^d, \sigma^2_d\textbf{I}_d\big)$ , where for some constant $l \in \mathbb{R}^{+}$ ,

\begin{equation*}\sigma^2_d = l^2/(d-1)\,.\end{equation*}

The proposal $Q_d$ is used to generate a d-dimensional Markov chain, $\boldsymbol{{X}}^d = \big\{\boldsymbol{{X}}^d_n, n \ge 0\big\}$ , following the accept–reject mechanism with acceptance function $\alpha$ . Under these conditions and with $\alpha = \alpha_{\text{MH}}$ , [Reference Roberts, Gelman and Gilks27] established weak convergence to an appropriate Langevin diffusion for the sequence of 1-dimensional stochastic processes constructed from the first component of these Markov chains. Since the coordinates are independent and identically distributed (i.i.d.), this limit informs the limiting behaviour of the full Markov chain in high dimensions. In what follows, we extend the results of [Reference Roberts, Gelman and Gilks27] to the class of acceptance functions $\mathcal{A}$ as defined in Definition 1.

Let $\big\{\boldsymbol{{Z}}^d, d > 1\big\}$ be a sequence of processes constructed by speeding up the Markov chains by a factor of d as follows:

\begin{equation*}\boldsymbol{{Z}}^d_t = \boldsymbol{{X}}^d_{[dt]} = \Big(X^d_{[dt],1}, X^d_{[dt],2}, \dots, X^d_{[dt],d}\Big)^T; \ \ \ t > 0.\end{equation*}

Suppose $\big\{\eta_d\,:\, \mathbb{R}^d \to \mathbb{R}\big\}$ is a sequence of projection maps such that $\eta_d\big(\boldsymbol{{x}}^d\big) = x_1^d$ . Define a new sequence of 1-dimensional processes $\big\{U^d, d > 1\big\}$ as follows:

\begin{equation*}U^d_t \,:\!=\, \eta_d \circ \boldsymbol{{Z}}^d_t = {X}^d_{[dt], 1}; \ \ \ t > 0.\end{equation*}

Under stationarity, we show that $\big\{U^d, d > 1\big\}$ weakly converges in the Skorokhod topology [Reference Ethier and Kurtz10] to a Markovian limit U. We denote weak convergence of processes in the Skorokhod topology by ‘ $\Rightarrow$ ’ and standard Brownian motion at time t by $B_t$ . The proofs are in the appendices.

Theorem 1. Let $\big\{\boldsymbol{{X}}^d, d \ge 1\big\}$ be the sequence of $\boldsymbol{\pi}_d$ -invariant Markov chains constructed using the acceptance function $\alpha$ and proposal $Q_d$ such that $\boldsymbol{{X}}^d_0 \sim \boldsymbol{\pi}_d$ . Further, suppose $\alpha \in \mathcal{A}$ and $\boldsymbol{\pi}_d$ satisfies Assumption 1. Then $U^d \Rightarrow U$ , where U is a diffusion process that satisfies the Langevin stochastic differential equation,

\begin{equation*}dU_t = (h_{\alpha}(l))^{1/2}dB_t + h_{\alpha}(l)\frac{f^{\prime}(U_t)}{2f(U_t)}dt,\end{equation*}

with $h_{\alpha}(l) = l^2 M_{\alpha}(l)$ , where

(10) \begin{equation}M_{\alpha}(l) = \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2I}} \exp\!\left \{ \frac{-(b + l^2I/2)^2}{2l^2I}\right\} db\end{equation}

and

\begin{equation*}I = \mathbb{E}_{f}\!\left[ \left(\frac{f^{\prime}(X)}{f(X)}\right)^2 \right].\end{equation*}

Remark 1. Since $\alpha_{\text{MH}} \in \mathcal{A}$ , our result aligns with [Reference Roberts, Gelman and Gilks27], because

\begin{equation*}M_{\text{MH}} (l) = \int_{\mathbb{R}} g_{\text{MH}}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2I}} \exp\!\left \{ \frac{-(b + l^2I/2)^2}{2l^2I}\right\} db = 2 \Phi\!\left({-}\frac{l\sqrt{I}}{2} \right)\,.\end{equation*}

Remark 2. For symmetric proposals, Definition 1 requires $\alpha$ to be a function of only the ratio of the target densities at the two contested points. Thus, the result is not applicable to acceptances in [Reference Banterle, Grazian, Lee and Robert1, Reference Mira22, Reference Vats, Gonçalves, Łatuszyński and Roberts39].

In Theorem 1, $h_{\alpha}(l)$ is the speed measure of the limiting diffusion process and so the optimal choice of l is $l^*$ such that

\begin{equation*}l^* = \underset{l}{\text{arg} \, \text{max}}\, h_{\alpha}(l).\end{equation*}

Denote the average acceptance probability by

\begin{equation*}\alpha_{d}(l) \,:\!=\, \mathbb{E}_{\boldsymbol{\pi}_d, Q_d}\! \left[ \alpha\big(\boldsymbol{{X}}^d, \boldsymbol{{Y}}^d\big) \right] = \int \int \pi\big(\boldsymbol{{x}}^d\big) \ \alpha\big(\boldsymbol{{x}}^d, \boldsymbol{{y}}^d\big) \ q_d\big(\boldsymbol{{x}}^d, \boldsymbol{{y}}^d\big) \ d\boldsymbol{{x}}^d \ d\boldsymbol{{y}}^d\,,\end{equation*}

and the asymptotic acceptance probability by $\alpha(l) \,:\!=\, \lim_{d \to \infty} \alpha_{d}(l)$ . The dependence on l is through the variance of the proposal kernel. We then have the following corollary.

Corollary 1. Under the setting of Theorem 1, we obtain $\alpha(l) = M_{\alpha}(l)$ , and the asymptotically optimal acceptance probability is $M_{\alpha}(l^*)$ .

Corollary 1 is of considerable practical relevance, since for different acceptance functions it yields the optimal target acceptance probability to tune to.

3.1. Optimal results for some acceptance functions

In Section 2, we discussed some important members of the class $\mathcal{A}$ . Corollary 1 can then be used to obtain the AOAR for them by maximizing the speed measure of the limiting diffusion process. For Barker’s algorithm, from Theorem 1 and (6), the speed measure $h_{\text{B}}(l)$ of the corresponding limiting process is $h_{\text{B}}(l) = l^2M_{\text{B}}(l)$ , where

\begin{equation*}M_{\text{B}}(l) = \int_{\mathbb{R}} \frac{{}1}{1 + e^{-b}} \frac{1}{\sqrt{2\pi l^2I}} \exp\!\left \{ \frac{-(b + l^2I/2)^2}{2l^2I}\right\} db.\end{equation*}

Maximizing $h_{\text{B}}(l)$ , the optimal value $l^*$ is approximately (see Appendix C)

\begin{equation*}l^* = \frac{2.46}{\sqrt{I}}\,.\end{equation*}

By Corollary 1, using this $l^*$ yields an asymptotic acceptance rate of approximately $0.158$ . Hence, when the optimal variance is not analytically tractable in high dimensions, one may consider tuning the algorithm so as to achieve an acceptance probability of approximately $0.158$ . Additionally, the right plot in Figure 1 verifies that the relative efficiency of Barker’s versus MH, as measured by the ratio of their respective speed measures for a fixed l, remains above $0.5$ [Reference Łatuszyński and Roberts18, Theorem 4]; this relative efficiency increases as l increases. The ratio of the speed measures of Barker’s versus MH at their respective optimal scaling is $0.72$ . This quantifies the loss in efficiency in running the best version of Barker’s compared to the best version of the MH algorithm. We can also study the respective speed measures as a function of the acceptance rate; this is given in the left plot in Figure 1. We find that as the asymptotic acceptance rate increases, the speed measure for Barker’s decreases more rapidly than that of MH. This suggests that there is much to gain by appropriately tuning Barker’s algorithm.

Figure 1. Efficiency (h(l)) versus acceptance rate $(\alpha(l))$ with $I = 1$ (left). Relative efficiency of Barker’s versus MH $(h_{\text{B}}(l)/h_{\text{MH}}(l) )$ , plotted against l (right).

For lower dimensions, the optimal acceptance rate is higher than the AOAR. Figure 2 shows optimal values for MH and Barker’s algorithms on isotropic Gaussian targets in dimensions 1 to 10, the proposal kernel being the same as in the setting of Theorem 1. This plot is produced using the criterion of minimizing first-order auto-correlations in each component [Reference Gelman, Roberts and Gilks11, Reference Roberts and Rosenthal28, Reference Roberts and Rosenthal29]. For $\alpha_{\text{MH}}$ and $\alpha_{\text{B}}$ , the optimal acceptance rates in one dimension are $0.43$ and $0.27$ respectively.

Figure 2. Optimal acceptance rate against number of dimensions.

For lazy MH with $\varepsilon \in [0, 1]$ , Corollary 1 implies that the AOAR of the algorithm is $(1 - \varepsilon)0.234$ with the same optimal $l^*$ as MH. For the acceptance functions $\alpha_h^{\text{H}}$ in (8),

\begin{equation*}M_h(l) = 2\boldsymbol{\Phi} \!\left({-}\frac{\sqrt{h + l^2I}}{2} \right)\,.\end{equation*}

With $h = 0$ , we obtain the result of [Reference Roberts, Gelman and Gilks27] for MH. Further, the left panel of Figure 3 highlights that as $h \to 0$ , the AOAR increases to $0.234$ and the algorithm worsens as h increases. Moreover, for $h \approx 1.913$ , the AOAR is roughly $0.158$ , i.e. equivalent to Barker’s acceptance function.

Figure 3. Optimal acceptance rates for $\alpha^{\text{H}}_h$ against h (left) and $\alpha^{\text{R}}_r$ against r (right).

Lastly, the AOARs for $\alpha_r^{\text{R}}$ in (7) are available. For $r = 1, \dots, 10$ , the results have been plotted in the right plot of Figure 3. As anticipated, the AOAR approaches $0.234$ as r increases. Notice that $\alpha_2^{\text{R}}$ yields an AOAR of $0.197$ , which is a considerable increase from $\alpha_B = \alpha_1^{\text{R}}$ . Table 1 below summarizes the results of this section. (Code for all plots and tables is available at https://github.com/Sanket-Ag/BarkerScaling.)

Table 1. Optimal proposal variance and asymptotic acceptance rates.

4. Numerical results

We study the estimation quality for different expectations as a function of the proposal variance (acceptance rate) for the generalized Barker acceptance function, $\alpha_r^{\text{R}}$ . We focus on $r = 1$ (Barker’s algorithm) and $r = 2$ . Suppose $f\,:\,\mathbb{R}^d \to \mathbb{R}$ is the function whose expectation with respect to $\boldsymbol{\pi}_d$ is of interest. Let $\{f(\boldsymbol{{X}}_n)\}$ be the mapped process. Similarly to [Reference Roberts and Rosenthal29], we assess the choice of proposal variance by the convergence time:

\begin{equation*}\text{convergence time } \,:\!=\, \frac{-k}{\log\! (\rho_k)}\,,\end{equation*}

where $\rho_k$ is the lag-k autocorrelation in $\{f(\boldsymbol{{X}}_n)\}$ . In each of the following simulations, convergence time is estimated by averaging over $10^3$ replications of Markov chains, each of length $10^6$ with $k = 1$ . We chose a range of values of l where l is such that $\sigma^2_d = l^2/d$ in a Gaussian proposal kernel $Q_d\big(\boldsymbol{{x}}^d, \cdot\big) = N\big(\boldsymbol{{x}}^d, \sigma^2_d \textbf{I}_d\big)$ . Consider first the case of an isotropic target, $\boldsymbol{\pi}_{d} = N_{d}(\boldsymbol{0}, \textbf{I}_{d})$ with isotropic Gaussian proposals; the conditions of Theorem 1 are satisfied. The estimated convergence time for $f(\boldsymbol{{x}}) = x_1$ and $f(\boldsymbol{{x}}) = \bar{{x}}$ , where $\bar{{x}}$ is the mean of all components $x_1, \dots, x_d$ , is plotted in Figure 4 (top row). Here, $d = 50$ . For both functions of interest, the optimal performance, i.e. the minimum convergence time, corresponds to an acceptance rate of approximately $0.158$ for $\alpha_{\text{B}}$ and $0.197$ for $\alpha_2^{\text{R}}$ ; the slight overestimation is due to the finite-dimensional setting. Next, we consider $\boldsymbol{\pi}_d = N_d(\boldsymbol{0}, \boldsymbol{\Sigma}_d)$ where $\boldsymbol{\Sigma}_d$ is a $d \times d$ matrix with 1 on its diagonal and all other elements are equal to some non-zero $\rho$ . Here, the assumptions in Theorem 1 are not satisfied. For such a target and for $\alpha_{\text{MH}}$ , [Reference Roberts and Rosenthal29] showed that the rate of convergence of the algorithm is governed by the eigenvalues of $\boldsymbol{\Sigma}_d$ . In particular, the eigenvalues of $\boldsymbol{\Sigma}_d$ are $d\rho + 1 - \rho$ and $1 - \rho$ , with associated eigenvectors $\boldsymbol{{y}}$ such that $\boldsymbol{{y}}^T\boldsymbol{{x}}$ yields $\bar{{x}}$ and $x_i - \bar{{x}}$ (for $i = 1,\dots,d$ ), respectively. Then, it was shown that the algorithm converges quickly for functions orthogonal to $\bar{{x}}$ , but much more slowly for $\bar{{x}}$ . Despite the differing rates of convergence, the optimal acceptance rate, corresponding to the minimum convergence time, remains the same. We find this also to be true for $\alpha_{\text{B}}$ and $\alpha^{\text{R}}_2$ as illustrated in Figure 4 (bottom row), where we present convergence times for $x_1 - \bar{{x}}$ and $\bar{{x}}$ . Once again, $d = 50$ . The large difference between convergence times for both is quite evident from the y-axis of the two plots. The minimum again lies in a region around the asymptotic optimal. We note that because of the slow convergence rate of $\bar{{x}}$ , the process demonstrates slow mixing, yielding more variable estimates of the convergence time. For both simulation settings, we see the expected improvement in the convergence time for $\alpha_2^{\text{R}}$ compared to $\alpha_{\text{B}}$ .

Figure 4. Convergence times for $\alpha_{\text{B}}$ against acceptance rate in the isotropic setting (top row) and the correlated target setting (bottom row).

4.1. A Bayesian logistic regression example

We consider fitting a Bayesian logistic regression model to the famous Titanic dataset, which contains information on crew and passengers aboard the 1912 RMS Titanic. Let $\boldsymbol{{y}}$ denote the response vector (indicating whether each person survived or not), and let $\boldsymbol{{X}}$ denote the $n \times d$ model matrix; here $d = 10$ . We assume a multivariate zero-mean Gaussian prior on $\boldsymbol{\beta}$ with covariance $100\textbf{I}_{10}$ . The resulting target density is

\begin{equation*}\pi(\boldsymbol{\beta} \mid \boldsymbol{{y}}) \propto \exp \!\left\{ - \frac{\boldsymbol{\beta}^T\boldsymbol{\beta}}{2} \prod_{i = 1}^n \frac{\exp\big({-}\boldsymbol{{x}}_i^T \boldsymbol{\beta}\big)^{1 - y_i}}{1 + \exp\big({-}\boldsymbol{{x}}_i^T \boldsymbol{\beta}\big)}\right\}\,.\end{equation*}

For the Titanic dataset, the resulting posterior has a complicated covariance structure, with many components exhibiting an absolute mutual correlation of beyond .50. The posterior is also ill-conditioned, with the condition number of the estimated target covariance matrix being $\approx 10^5$ . As seen in the bottom row of Figure 4, in such situations an isotropic proposal kernel might perform poorly for most functions. We instead consider a Gaussian proposal scheme where the proposal covariance matrix is taken to be proportional to the target covariance matrix. This is a common strategy for dealing with targets with correlated components and forms the basis for many adaptive MCMC kernels [Reference Roberts and Rosenthal30]. We implement Barker’s algorithm to sample from the posterior. Let $\boldsymbol{\Sigma}_d$ denote the covariance matrix associated with the posterior distribution of $\boldsymbol{\beta}$ ; then the proposal kernel $Q_d\big(\boldsymbol{{x}}^d, \cdot\big) = N\big(\boldsymbol{{x}}^d, \sigma^2_d \boldsymbol{\Sigma}_d\big)$ . Since $\boldsymbol{\Sigma}_d$ is unavailable, we estimate it from a pilot MCMC run of size $10^7$ . We then consider various values of $\sigma^2_d = l^2/d$ .

The performance of the algorithm for different functions of interest is plotted in Figure 5. Since this is a 10-dimensional problem, the optimal acceptance rate from Figure 2 is approximately $0.18$ . The convergence times for both, $\beta_1 - \bar{\beta}$ and $\bar{\boldsymbol{\beta}}$ , are similar. Furthermore, both are minimized at approximately the same acceptance rate of $0.18$ . It is natural here to be interested in estimating the posterior mean vector. Thus, we also study the properties of the vector $\boldsymbol{\beta}$ , with efficiency measured via the multivariate effective sample size (ESS) [Reference Vats, Flegal and Jones38]. The ESS returns the equivalent number of i.i.d. samples from $\boldsymbol{\pi}$ that would yield the same variability in estimating the posterior mean as the given set of MCMC samples. In Figure 5, we see that the optimal acceptance rate, corresponding to the highest ESS values, is achieved around $0.18$ .

Figure 5. Convergence times for $\alpha_{\text{B}}$ (left and middle) and multivariate ESS for the posterior mean vector (right) against acceptance rate.

5. Conclusions

We have obtained optimal scaling and acceptance rates for a large class of acceptance functions. In doing so, we have found that the scaling factor of $1/d$ for the proposal variance holds for all acceptance functions, indicating that the acceptance functions are not likely to affect the rate of convergence, just the constants associated with that rate. Thus, practitioners need not hesitate in switching to other acceptance functions when the MH acceptance probability is not tractable, as long as Corollary 1 is used to tune their algorithm accordingly. There is also an inverse relationship between optimal variance and AOAR (see Table 1), implying that when dealing with sub-optimal acceptance functions, the algorithm seeks larger jumps. The computational cost of the Bernoulli factory that we present for $\alpha_r^{\text{R}}$ in Appendix D increases with r. Given the large jump in the optimal acceptance probability from $r = 1$ to $r = 2$ , the development of more efficient Bernoulli factories is an important problem for future work. The assumption of starting from stationarity is a restrictive one. For MH with Gaussian proposals, the scaling factor of $1/d$ is still optimal when the algorithm is in the transient phase [Reference Christensen, Roberts and Rosenthal6, Reference Jourdain, Lelièvre and Miasojedow16, Reference Kuntz, Ottobre and Stuart17]. The optimal acceptance probability may vary depending on the starting distribution. We envision that similar results are viable for the general class of acceptance functions, and this is important future work. Our results are limited to only Gaussian proposals and trivially decomposable target densities. Other proposal distributions may make use of the gradient of the target, e.g. the Metropolis-adjusted Langevin algorithm [Reference Roberts and Tweedie31] and Hamiltonian Monte Carlo [Reference Duane, Kennedy, Pendleton and Roweth9]. In problems where $\alpha_{\text{MH}}$ cannot be used, the gradient of the target density is likely unavailable; thus it is reasonable to limit our attention to a Gaussian proposal. On the other hand, generalizations to other target distributions are important. For MH algorithms, [Reference Bédard3, Reference Sherlock and Roberts34] relax the independence assumption, while [Reference Roberts and Rosenthal29] relax the identically distributed assumption. Additionally, [Reference Yang, Roberts and Rosenthal40] present a proof of weak convergence for MH for more general targets, and [Reference Schmon and Gagnon33] provide optimal scaling results for general Bayesian targets using large-sample asymptotics. In these situations, extensions to other acceptance probabilities are similarly possible. Additionally, we encourage future work in optimal scaling to leverage our proof technique to demonstrate results for the wider class of acceptance probabilities.

Appendix A. Proof of Theorem 1

The proof is structurally similar to the seminal work of [Reference Roberts, Gelman and Gilks27], in that we will show that the generator of the sped-up process, $\textbf{Z}^d$ , converges to the generator of an appropriate Langevin diffusion. Define the discrete-time generator of $\boldsymbol{{Z}}^d$ as

(11) \begin{equation}G_d V\big(\boldsymbol{{x}}^d\big) = d \cdot \mathbb{E}_{\boldsymbol{{Y}}^d}\left[\big(V\big(\boldsymbol{{Y}}^d\big) - V\big(\boldsymbol{{x}}^d\big)\big)\alpha\big(\boldsymbol{{x}}^d, \boldsymbol{{Y}}^d\big)\right],\end{equation}

for all those V for which the limit exists. Since our interest is in the first component of $\boldsymbol{{Z}}^d$ , we consider only those V which are functions of the first component only. Now, define the generator of the limiting Langevin diffusion process with speed measure $h_{\alpha}(l)$ as

(12) \begin{equation}GV(x) = h_{\alpha}(l)\!\left[ \frac{1}{2}V^{\prime\prime}(x) + \frac{1}{2}\frac{d}{dx}(\!\log f)(x)V^{\prime}(x) \right].\end{equation}

The unique challenge in our result is identifying the speed measure $h_{\alpha}(l)$ for a general acceptance function $\alpha \in \mathcal{A}$ . Proposition 1 is a key result that helps us obtain a form of $h_{\alpha}(l)$ without resorting to approximations.

To prove Theorem 1, we will show that there are events $F_d \subseteq \mathbb{R}^d$ such that for all t,

\begin{equation*}\mathbb{P}\big[\boldsymbol{{Z}}_s^d \in F_d,\ 0 \le s \le t\big] \to 1 \text{ as } d \to \infty \quad \text{ and}\end{equation*}
\begin{equation*}\lim_{d \to \infty}\sup_{\boldsymbol{{x}}^d \in F_d}|G_dV\big(\boldsymbol{{x}}^d\big) - GV\big(x_1^d\big)| = 0\,,\end{equation*}

for a suitably large class of real-valued functions V. Moreover, because of the conditions of Lipschitz continuity on $f^{\prime}/f$ , a core for the generator G has domain $C_c^{\infty}$ , the class of infinitely differentiable functions with compact support [Reference Ethier and Kurtz10, Theorem 2.1, Chapter 8]. Thus, we can limit our attention to only those $V \in C_c^{\infty}$ that are a function of the first component.

Consider now the setup of Theorem 1. Let $w = \log f$ and $\alpha \in \mathcal{A}$ with the balancing function $g_{\alpha}$ . Let w ′ and w ′′ be the first and second derivatives of w, respectively. Define the sequence of sets $\{F_d \subseteq \mathbb{R}^d, d > 1 \}$ by

\begin{align*}F_d = \big\{|R_d(x_2, \dots, x_d) - I| &< d^{-1/8} \big\} \cap \big\{|S_d(x_2, \dots, x_d) - I| < d^{-1/8} \big\}, \quad \text{where} \\R_d(x_2, \dots, x_d) &= \frac{1}{d-1}\sum_{i=2}^d \ [\!\log\! (f(x_i))^{\prime}]^2 = \frac{1}{d-1}\sum_{i=2}^d \ [w^{\prime}(x_i)]^2 \quad \text{ and} \\S_d(x_2, \dots, x_d) &= \frac{-1}{d-1} \sum_{i=2}^d \ [\!\log\! (f(x_i))^{\prime\prime}] = \frac{-1}{d-1}\sum_{i=2}^d \ [w^{\prime\prime}(x_i)] \,.\end{align*}

The following results from [Reference Roberts, Gelman and Gilks27] will be needed.

Lemma 1. ([Reference Roberts, Gelman and Gilks27].) Let Assumption 1 hold. If $\boldsymbol{{X}}^d_0 \sim \boldsymbol{\pi}_d$ for all d, then, for a fixed t, $\mathbb{P}[\boldsymbol{{Z}}_s^d \in F_d,\ 0 \le s \le t] \to 1 \text{ as } d \to \infty\,.$

Lemma 2. ([Reference Roberts, Gelman and Gilks27].) Let Assumption 1 hold. Also, let

\begin{equation*}W_d(x_1, \dots, x_d) = \sum_{i=2}^d\left( \frac{1}{2}w^{\prime\prime}(x_i)(Y_i - x_i)^2 +\frac{l^2}{2(d-1)}w^{\prime}(x_i)^2 \right),\end{equation*}

where $Y_i \overset{\text{ind}}{\sim} N\big(x_i, \sigma^2_d\big)$ , $i = 2, \dots, d$ . Then $ \sup_{\boldsymbol{{x}}^d \in F_d}\mathbb{E}\!\left[\left|W_d\big(\boldsymbol{{x}}^d\big)\right|\right] \to 0\,$ .

Lemma 3. ([Reference Roberts, Gelman and Gilks27].) For $Y \sim N\big(x, \sigma^2_d\big)$ and $V \in C_c^{\infty}$ ,

\begin{equation*}\limsup_{d \to \infty}\, \sup_{x \in \mathbb{}R} d|\mathbb{E}[V(Y) - V(x)]| < \infty\,.\end{equation*}

For the following proposition, we will utilize the property (2) imposed on $\mathcal{A}$ . This proposition is the key to obtaining our main result in such generality.

Proposition 1. Let $X \sim N({-}\theta/2, \theta)$ for some $\theta > 0$ . Let $\alpha \in \mathcal{A}$ with the corresponding balancing function $g_\alpha$ . Then $\mathbb{E}\!\left[ Xg_{\alpha}\big(e^X\big)\right] = 0$ .

Proof. We have

\begin{equation*}\big|\mathbb{E}\!\left[Xg_{\alpha}\big(e^X\big) \right]\big| \le \mathbb{E}\!\left[|Xg_{\alpha}\big(e^X\big)| \right] \le \mathbb{E}\!\left[|X| \right] < \infty;\end{equation*}

the second inequality follows from the assumption that $g_{\alpha}$ lies in [0, 1]. Hence, the expectation exists and is equal to the integral

\begin{equation*}\int_{\mathbb{R}} x\,g_{\alpha}\!\left( e^{x}\right) \frac{1}{\sqrt{2\pi\theta}} \exp\!\left \{ \frac{-(x+\theta/2)^2}{2\theta}\right\} dx \,=\!:\, \int_{\mathbb{R}} h(x) dx \,.\end{equation*}

Observe that, using (2),

\begin{align*}h({-}x) &= -x\,g_{\alpha}\!\left( e^{-x}\right) \frac{1}{\sqrt{2\pi \theta}} \exp\!\left \{ \frac{-({-}x + \theta/2)^2}{2\theta}\right\} \\&= -x\,g_{\alpha}\!\left( e^{-x}\right) \frac{1}{\sqrt{2\pi\theta}} \exp\!\left \{ \frac{-1}{2\theta}\left( x^2 + \frac{\theta^2}{4} - x\theta \right)\right\} \\&= -x e^{-x} g_{\alpha}\left(e^x\right) \frac{1}{\sqrt{2\pi\theta}} \exp\!\left \{ \frac{-1}{2\theta}\left( x^2 + \frac{\theta^2}{4} - x\theta \right)\right\} \\&= -x\,g_{\alpha}\!\left( e^{x}\right)\frac{1}{\sqrt{2\pi\theta}} \exp\!\left \{\frac{-1}{2\theta}\!\left( x^2 + \frac{\theta^2}{4} + x\theta \right)\right\} \\&= -x\,g_{\alpha}\!\left( e^{x}\right) \frac{1}{\sqrt{2\pi\theta}} \exp\!\left \{ \frac{-(x+\theta/2)^2}{2\theta}\right\} \\&= - h(x).\end{align*}

Hence, the result follows.

Lemma 4. Suppose $V \in C_c^{\infty}$ is restricted to only the first component of $\boldsymbol{{Z}}^d$ . Then

\begin{equation*}\sup_{\boldsymbol{{x}}^{\boldsymbol{{d}}} \in F_d} |G_dV\big(\boldsymbol{{x}}^d\big) - GV\big(x_1^d\big)| \to 0 \quad as \ d \to \infty.\end{equation*}

Proof. In the expression for $G_dV\big(\boldsymbol{{x}}^d\big)$ given in (11), we can decompose the proposal $\boldsymbol{{Y}}^d$ into $(Y_1^d, \boldsymbol{{Y}}^{d-})$ and thus rewrite the expectation as follows:

(13) \begin{equation}G_dV\big(\boldsymbol{{x}}^d\big) = d\mathbb{E}_{Y_1^d}\!\left[\left(V\big(Y^d_1\big) - V\big(x^d_1\big)\right)\mathbb{E}_{\boldsymbol{{Y}}^{d-}}\left[ \alpha\big(\boldsymbol{{x}}^d, \boldsymbol{{Y}}^d\big) \mid Y_1^d \right]\right].\end{equation}

Let $E^{d, \alpha}$ denote the inner expectation in (13) and define $E_{lim}^{d, \alpha}$ as

(14) \begin{equation}\begin{aligned}E_{lim}^{d, \alpha} = \mathbb{E}_{\boldsymbol{{Y}}^{d-}} \left[ g_{\alpha}\!\left( \exp\!\left\{ \log\dfrac{f\big(Y_1^d\big)}{f\big(x_1^d\big)} + \displaystyle \sum_{i=2}^d\left( w^{\prime}\big(x_i^d\big)\big(Y_i^d - x_i^d\big) - \frac{l^2w^{\prime}\big(x_i^d\big)^2}{2(d-1)}\right) \right\} \right) \bigg| Y_1^d \right].\end{aligned}\end{equation}

Also, a Taylor series expansion of w about $x_i^d$ for $i = 2, \dots, d$ gives

\begin{align*}E^{d, \alpha} &= \mathbb{E}_{\boldsymbol{{Y}}^{d-}} \left[ g_{\alpha}\!\left( \exp\!\left\{ \log\frac{f\big(Y_1^d\big)}{f\big(x_1^d\big)} + \sum_{i=2}^d w^{\prime}\big(x_i^d\big)\big(Y_i^d - x_i^d\big) \right. \right. \right. \\&\quad \quad \quad \quad \left. \left. \left. + \frac{1}{2}w^{\prime\prime}\big(x_i^d\big)\big(Y_i^d - x_i^d\big)^2 + \frac{1}{6}w^{\prime\prime\prime}(Z_i)\big(Y_i^d - x_i^d\big)^3 \right\} \right) \bigg| Y_1^d \right]\end{align*}

for $Z_i$ lying between $x_i^d$ and $Y_i^d$ . Hence, the triangle inequality and Lipschitz continuity of $g(e^z)$ give, for some Lipschitz constant $K < \infty$ ,

(15) \begin{align}\big|E^{d, \alpha} - E^{d,\alpha}_{lim}\big| &\le K\mathbb{E}_{\boldsymbol{{Y}}^{d-}}\left[ \left| \sum_{i=2}^d \frac{1}{2}w^{\prime\prime}\big(x_i^d\big)\big(Y_i^d - x_i^d\big)^2 + \frac{1}{6}w^{\prime\prime\prime}(Z_i)\big(Y_i^d - x_i^d\big)^3 + \frac{l^2w^{\prime}\big(x_i^d\big)^2}{2(d-1)} \right| \right] \notag \\&\le K\mathbb{E}_{\boldsymbol{{Y}}^{d-}}\!\left[\left|W_d\big(\boldsymbol{{x}}^d\big)\right|\right] + K\sup_{z\in \mathbb{R}}|w^{\prime\prime\prime}(z)|\frac{l^3}{(d-1)^{1/2}},\end{align}

where $W_d\big(\boldsymbol{{x}}^d\big)$ is as defined in Lemma 2. From Lemma 2, Lemma 3, and (15),

(16) \begin{equation}\sup_{\boldsymbol{{x}}^d \in F_d} \left|G_dV\big(\boldsymbol{{x}}^d\big) - d\mathbb{E}_{Y_1^d}\!\left[\left(V\big(Y^d_1\big) - V\big(x^d_1\big)\right)E_{lim}^{d, \alpha}\right]\right| \to 0 \text{ as } d \to \infty.\end{equation}

Now let $\varepsilon(y) = \log f(y) - \log f\big(x_1^d\big)$ . Also, from (14), it is clear that given $\boldsymbol{{x}}^d$ , $E^{d,\alpha}_{lim}$ is a function of $Y_1^d$ alone, to wit,

(17) \begin{equation}(M_{d, \alpha}\circ\varepsilon)\big(Y_1^d\big) \,:\!=\, E^{d,\alpha}_{lim} = \mathbb{E}\!\left[ g_{\alpha}\big(e^{B_d}\big)\right],\end{equation}

where $B_d \sim N(\mu_d, \Sigma_d)$ with $ \mu_d = \varepsilon\big(Y_1^d\big) - l^2R_d/2$ and $\Sigma_d = l^2R_d$ . Thus, by (15), it is enough to consider the asymptotic behaviour of

\begin{equation*}d\mathbb{E}_{Y_1^d}\!\left[\left(V\big(Y^d_1\big) - V\big(x^d_1\big)\right)M_{d, \alpha}\big(\varepsilon\big(Y_1^d\big)\big)\right].\end{equation*}

Let $N_{d, \alpha} = M_{d, \alpha} \circ \varepsilon$ and apply Taylor series expansion on the inner term to obtain

\begin{align*}&\left(V\big(Y^d_1\big) - V\big(x^d_1\big)\right)M_{d, \alpha}\big(\varepsilon\big(Y_1^d\big)\big) \\&\quad = \left( V^{\prime}\big(x_1^d\big)\big(Y_1^d - x_1^d\big) + \frac{1}{2}V^{\prime\prime}\big(x_1^d\big)\big(Y_1^d - x_1^d\big)^2 + \frac{1}{6}V^{\prime\prime\prime}(K_d)\big(Y_1^d - x_1^d\big)^3\right) \\&\quad \quad \times \left( N_{d, \alpha}\big(x_1^d\big) + N^{\prime}_{d, \alpha}\big(x_1^d\big)\big(Y_1^d - x_1^d\big) + \frac{1}{2}N^{\prime\prime}_{d, \alpha}(L_d)\big(Y_1^d - x_1^d\big)^2\right),\end{align*}

where $K_d, L_d \in \big[Y_1^d, x_1^d\big]$ or $\big[x_1^d, Y_1^d\big]$ , and

(18) \begin{align}N_{d, \alpha}\big(x_1^d\big) &= M_{d, \alpha}\big(\varepsilon\big(x_1^d\big)\big) = M_{d, \alpha}\!\left(\!\log \frac{f\big(x_1^d\big)}{f\big(x_1^d\big)}\right) = M_{d, \alpha}(0), \nonumber\\N^{\prime}_{d, \alpha}\big(x_1^d\big) &= M^{\prime}_{d, \alpha}\big(\varepsilon\big(x_1^d\big)\big)\varepsilon^{\prime}\big(x_1^d\big) = M^{\prime}_{d, \alpha}(0)w^{\prime}\big(x_1^d\big) \,.\end{align}

Now, for all d,

\begin{align*}M_{d, \alpha}(\varepsilon) &= \mathbb{E}\!\left[ g_{\alpha}\big(e^{B_d}\big)\right]= \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \exp\!\left \{ \frac{-\big(b - \varepsilon + l^2R_d/2\big)^2}{2l^2R_d}\right\} db. \\\text{So } M_{d, \alpha}(0) &= \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \exp\!\left \{ \frac{-\big(b + l^2R_d/2\big)^2}{2l^2R_d}\right\} db. \\\text{Also, }M_{d, \alpha}^{\prime}(\varepsilon) &= \frac{d}{d\varepsilon} \left( \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \exp\!\left \{ \frac{-\big(b - \varepsilon + l^2R_d/2\big)^2}{2l^2R_d}\right\} db \right).\end{align*}

The derivatives and integral here can be exchanged thanks to the dominated convergence theorem, which yields

\begin{align*}M^{\prime}_{d, \alpha}(\varepsilon)&= \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \left( \frac{2\big(b - \varepsilon + l^2R_d/2\big)}{2l^2R_d} \right) \exp\!\left \{ \frac{-\big(b - \varepsilon + l^2R_d/2\big)^2}{2l^2R_d}\right\} db. \\\text{So } M^{\prime}_{d, \alpha}(0) &= \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \left( \frac{\big(b + l^2R_d/2\big)}{l^2R_d} \right) \exp\!\left \{ \frac{-\big(b + l^2R_d/2\big)^2}{2l^2R_d}\right\} db \\&= \frac{1}{l^2R_d}\int_{\mathbb{R}} b\,g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \exp\!\left \{ \frac{-\big(b + l^2R_d/2\big)^2}{2l^2R_d}\right\} db \\& \qquad\qquad + \frac{1}{2}\int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}}\exp\!\left \{ \frac{-\big(b + l^2R_d/2\big)^2}{2l^2R_d}\right\} db \\&= \frac{1}{2} M_{d, \alpha}(0)\,,\end{align*}

where the first term vanishes by Proposition 1. Hence, for all d,

(19) \begin{equation}2M^{\prime}_{d, \alpha}(0) = M_{d, \alpha}(0) = \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \exp\!\left \{ \frac{-\big(b + l^2R_d/2\big)^2}{2l^2R_d}\right\} db.\end{equation}

Now, we plug the expressions obtained above into the Taylor series expansion of $\left(V\big(Y^d_1\big) - V\big(x^d_1\big)\right)M_{d, \alpha}(\varepsilon\big(Y_1^d\big))$ . The rest of the proof, with the help of Assumption 1, follows similarly as in [Reference Roberts, Gelman and Gilks27, Lemma 2.6].

Proof of Theorem 1. From Lemma 4, we have uniform convergence of generators on the sequence of sets with limiting probability 1. Thus, by Corollary 8.7 in [Reference Ethier and Kurtz10, Chapter 4], we have the required result of weak convergence (the condition that $C_c^{\infty}$ separates points was verified by [Reference Roberts, Gelman and Gilks27]).

Appendix B. Proof of Corollary 1

Lemma 5. Let $E^{d, \alpha}$ be the inner expectation in (13), and let $E_{lim}^{d, \alpha}$ be from (14). Then

\begin{equation*}\mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1} \!\left[ E^{d, \alpha} - E_{lim}^{d, \alpha}\bigg| \, \boldsymbol{{x}}^d \right]\right] \to 0 \ \ \ \ \ as \ \ d \to \infty.\end{equation*}

Proof. Consider

\begin{align*}\left| \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[ E^{d, \alpha} - E_{lim}^{d, \alpha}\bigg| \, \boldsymbol{{x}}^d \right]\right] \right|&\le \left| \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[ E^{d, \alpha} - E_{lim}^{d, \alpha}\bigg| \, \boldsymbol{{x}}^d \in F_d \right]\right] P\big(\boldsymbol{{x}}^d \in F_d\big) \right| \\& \quad + \left| \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[ E^{d, \alpha} - E_{lim}^{d, \alpha}\bigg| \, \boldsymbol{{x}}^d \in F^C_d \right]\right] P\big(\boldsymbol{{x}}^d \in F^C_d\big) \right|.\end{align*}

The second term goes to 0, since the expectation is bounded and by construction $P\big(\boldsymbol{{x}}^d \in F_d^C\big)\to 0$ as $d \to \infty$ . Also, following [Reference Roberts, Gelman and Gilks27],

\begin{equation*}\sup_{\boldsymbol{{x}}^d \in F_d}\big|E^{d, \alpha} - E_{lim}^{d, \alpha}\big| \to 0 \text{ as } d \to \infty.\end{equation*}

Then

\begin{align*}&\left| \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[ E^{d, \alpha} - E_{lim}^{d, \alpha}\bigg| \, \boldsymbol{{x}}^d \in F_d \right]\right] P\big(\boldsymbol{{x}}^d \in F_d\big) \right| \\& \quad\le \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[\sup_{\boldsymbol{{x}}^{\boldsymbol{{d}}} \in F_d}\left| E^{d, \alpha} - E_{lim}^{d, \alpha}\right|\bigg| \, \boldsymbol{{x}}^d \in F_d \right]\right] \to 0\,.\end{align*}

Proof of Corollary 1. Consider Equation (17). Using the Taylor series approximation of second order around $x_1$ ,

\begin{equation*}\mathbb{E}_{Y_1^d}\big[E_{lim}^{d, \alpha}\big] = \mathbb{E}[N_{d, \alpha}\big(Y_1^d\big)] = N_{d, \alpha}\big(x_1^d\big) + \frac{1}{2}N^{\prime\prime}_{d, \alpha}(W_{d,1})\frac{l^2}{d-1}\,,\end{equation*}

where $W_{d,1} \in \big[x_1^d, Y_1^d\big]$ or $\big[Y_1^d, x_1^d\big]$ . Since N ′′ is bounded [Reference Roberts, Gelman and Gilks27],

\begin{align*}\alpha(l)&= \lim_{d \to \infty} \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[ \mathbb{E}_{\boldsymbol{{Y}}^{d-}}\left[\alpha\big(\boldsymbol{{X}}^d, \boldsymbol{{Y}}^d\big) \bigg| Y_1^d, \boldsymbol{{x}}^d\right] \bigg| \, \boldsymbol{{x}}^d \right]\right] \\&= \lim_{d \to \infty} \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[E_{lim}^{d, \alpha} + E^{d, \alpha} - E_{lim}^{d, \alpha}\bigg| \boldsymbol{{x}}^d \right]\right].\end{align*}

As all expectations exist, we can split the inner expectation and use Lemma 5, so that

\begin{align*}\alpha(l) &= \lim_{d \to \infty} \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[E_{lim}^{d, \alpha} \bigg| \boldsymbol{{x}}^d \right]\right] + \lim_{d \to \infty} \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ \mathbb{E}_{Y_1^d} \!\left[ E^{d, \alpha} - E_{lim}^{d, \alpha}\bigg| \, \boldsymbol{{x}}^d \right]\right]\\&= \lim_{d \to \infty} \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[ M_{d, \alpha}(0) + \frac{1}{2}N^{\prime\prime}_{d, \alpha}(W_{d,1})\frac{l^2}{d-1}\right] \\&= \lim_{d \to \infty} \mathbb{E}_{\boldsymbol{\pi}_d} \!\left[\int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2R_d}} \exp\!\left \{ \frac{-\big(b + l^2R_d/2\big)^2}{2l^2R_d}\right\} db \right] \\&= \int_{\mathbb{R}} g_{\alpha}\big(e^b\big) \frac{1}{\sqrt{2\pi l^2I}} \exp\!\left \{ \frac{-(b + l^2I/2)^2}{2l^2I}\right\} db = M_{\alpha}(l)\,.\end{align*}

The last equality is by the law of large numbers and the continuous mapping theorem.

Appendix C. Optimizing speed for Barker’s acceptance

We need to maximize $h_{\text{B}}(l) = l^2M_{\text{B}}(l)$ . Let I be fixed arbitrarily. Then

\begin{equation*}h_{\text{B}}(l) = \frac{1}{I}\cdot l^2I\cdot \int_{\mathbb{R}} \frac{1}{1 + e^{-b}} \frac{1}{\sqrt{2\pi l^2I}} \exp\!\left \{ \frac{-(b + l^2I/2)^2}{2l^2I}\right\} db.\end{equation*}

For a fixed I, we can reparametrize the function by taking $\theta = l^2I$ , and so maximizing $h_{\text{B}}(l)$ over positive l will be equivalent to maximizing $h^1_{\text{B}}(\theta)$ over positive $\theta$ , where

\begin{equation*}h^1_{\text{B}}(\theta) = \int_{\mathbb{R}} \frac{\theta}{1 + e^{-b}} \frac{1}{\sqrt{2\pi \theta}} \exp\!\left \{ \frac{-(b + \theta/2)^2}{2\theta}\right\} db.\end{equation*}

We make the substitution $z = (b + \theta/2)/\sqrt{\theta}$ in the integrand to obtain

\begin{equation*}h_{\text{B}}^1(\theta) = \int_{\mathbb{R}} \frac{\theta}{1 + \exp\{-z\sqrt{\theta} + \theta/2\}} \frac{1}{\sqrt{2\pi}} e^{-z^2/2} dz = \mathbb{E}\!\left[ \frac{\theta}{1 + \exp\{-Z\sqrt{\theta} + \theta/2\}} \right],\end{equation*}

where the expectation is taken with respect to $Z \sim N(0,1)$ . This expectation is not available in closed form. However, standard numerical integration routines yield the optimal value of $\theta$ to be $6.028$ . This implies that the optimal value of l, say $l^*$ , is approximately equal to

\begin{equation*}l^* \approx \frac{2.46}{\sqrt{I}} \ \ (\text{up to 2 decimal places}).\end{equation*}

Using this $l^*$ yields an AOAR of approximately $0.158$ .

Appendix D. Bernoulli factory

To sample events of probability $\alpha_B$ , the two-coin algorithm, an efficient Bernoulli factory, was presented in [Reference Gonçalves, Łatuszyński and Roberts13]. Generalizing this to a die-coin algorithm, we present a Bernoulli factory for $\alpha_r^{\text{R}}$ for $r =2$ ; extensions to other r can be done similarly. Let $\pi(x) = c_x p_x$ with $p_x \in [0, 1]$ and $c_x > 0$ . Then

\begin{equation*}\alpha_2^{\text{R}}(x,y) = \dfrac{\pi(y)^2 + \pi(x) \pi(y)}{\pi(y)^2 + \pi(x) \pi(y) + \pi(x)^2} = \dfrac{c_y^2p_y^2 + c_xp_x c_yp_y}{c_y^2p_y^2 + c_xp_x c_yp_y + c_x^2p_x^2}\,.\end{equation*}

Acknowledgements

The authors thank the referees and the editor for comments that helped improve the work.

Funding information

D. Vats is supported by SERB grant SPG/2021/001322; K. Łatuszyński is supported by the Royal Society through the Royal Society University Research Fellowship; and G. Roberts is supported by the EPSRC grants CoSInES (EP/R034710/1) and Bayes for Health (EP/R018561/1).

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

References

Banterle, M., Grazian, C., Lee, A. and Robert, C. P. (2019). Accelerating Metropolis–Hastings algorithms by delayed acceptance. Found. Data Sci. 1, 103128.CrossRefGoogle Scholar
Barker, A. A. (1965). Monte Carlo calculations of the radial distribution functions for a proton–electron plasma. Austral. J. Phys. 18, 119134.CrossRefGoogle Scholar
Bédard, M. (2008). Optimal acceptance rates for Metropolis algorithms: moving beyond 0.234. Stoch. Process. Appl. 118, 21982222.CrossRefGoogle Scholar
Billera, L. J. and Diaconis, P. (2001). A geometric interpretation of the Metropolis–Hastings algorithm. Statist. Sci. 335339.Google Scholar
Brooks, S., Gelman, A., Jones, G. and Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton.CrossRefGoogle Scholar
Christensen, O. F., Roberts, G. O. and Rosenthal, J. S. (2005). Scaling limits for the transient phase of local Metropolis–Hastings algorithms. J. R. Statist. Soc. B [Statist. Methodology] 67, 253268.CrossRefGoogle Scholar
Delmas, J.-F. and Jourdain, B. (2009). Does waste recycling really improve the multi-proposal Metropolis–Hastings algorithm? An analysis based on control variates. J. Appl. Prob. 46, 938959.CrossRefGoogle Scholar
Doucet, A., Pitt, M. K., Deligiannidis, G. and Kohn, R. (2015). Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika 102, 295313.CrossRefGoogle Scholar
Duane, S., Kennedy, A. D., Pendleton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Phys. Lett. B 195, 216222.CrossRefGoogle Scholar
Ethier, S. N. and Kurtz, T. G. (1986). Markov Processes: Characterization and Convergence. John Wiley, New York.CrossRefGoogle Scholar
Gelman, A., Roberts, G. O. and Gilks, W. R. (1996). Efficient Metropolis jumping rules. Bayesian Statist. 5, 599608.Google Scholar
Gonçalves, F. B., Łatuszyński, K. and Roberts, G. O. (2017). Barker’s algorithm for Bayesian inference with intractable likelihoods. Brazilian J. Prob. Statist. 31, 732745.CrossRefGoogle Scholar
Gonçalves, F. B., Łatuszyński, K. and Roberts, G. O. (2017). Exact Monte Carlo likelihood-based inference for jump-diffusion processes. Preprint. Available at https://arxiv.org/abs/1707.00332.Google Scholar
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97109.CrossRefGoogle Scholar
Herbei, R. and Berliner, L. M. (2014). Estimating ocean circulation: an MCMC approach with approximated likelihoods via the Bernoulli factory. J. Amer. Statist. Assoc. 109, 944954.CrossRefGoogle Scholar
Jourdain, B., Lelièvre, T. and Miasojedow, B. (2014). Optimal scaling for the transient phase of Metropolis Hastings algorithms: the longtime behavior. Bernoulli 20, 19301978.CrossRefGoogle Scholar
Kuntz, J., Ottobre, M. and Stuart, A. M. (2019). Diffusion limit for the random walk Metropolis algorithm out of stationarity. Ann. Inst. H. Poincaré Prob. Statist. 55, 15991648.CrossRefGoogle Scholar
Łatuszyński, K. and Roberts, G. O. (2013). CLTs and asymptotic variance of time-sampled Markov chains. Methodology Comput. Appl. Prob. 15, 237247.CrossRefGoogle Scholar
Menezes, A. A. and Kabamba, P. T. (2014). Optimal search efficiency of Barker’s algorithm with an exponential fitness function. Optimization Lett. 8, 691703.CrossRefGoogle Scholar
Metropolis, N. et al. (1953). Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 10871092.CrossRefGoogle Scholar
Meyn, S. P. and Tweedie, R. L. (2012). Markov Chains and Stochastic Stability. Cambridge University Press.Google Scholar
Mira, A. (2001). On Metropolis–Hastings algorithms with delayed rejection. Metron 59, 231241.Google Scholar
Morina, G., Łatuszyński, K., Nayar, P. and Wendland, A. (2021). From the Bernoulli factory to a dice enterprise via perfect sampling of Markov chains. To appear in Ann. Appl. Prob. Google Scholar
Neal, P. and Roberts, G. O. (2006). Optimal scaling for partially updating MCMC algorithms. Ann. Appl. Prob. 16, 475515.CrossRefGoogle Scholar
Peskun, P. H. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika 60, 607612.CrossRefGoogle Scholar
Robert, C. and Casella, G. (2013). Monte Carlo Statistical Methods. Springer, New York.Google Scholar
Roberts, G. O., Gelman, A. and Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Prob. 7, 110120.Google Scholar
Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. J. R. Statist. Soc. B [Statist. Methodology] 60, 255268.CrossRefGoogle Scholar
Roberts, G. O. and Rosenthal, J. S. (2001). Optimal scaling for various Metropolis–Hastings algorithms. Statist. Sci. 16, 351367.CrossRefGoogle Scholar
Roberts, G. O. and Rosenthal, J. S. (2009). Examples of adaptive MCMC. J. Comput. Graph. Statist. 18, 349367.CrossRefGoogle Scholar
Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2, 341363.CrossRefGoogle Scholar
Schmon, S. M., Deligiannidis, G., Doucet, A. and Pitt, M. K. (2021). Large-sample asymptotics of the pseudo-marginal method. Biometrika 108, 3751.CrossRefGoogle Scholar
Schmon, S. M. and Gagnon, P. (2022). Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics. Statist. Comput. 32, 116.CrossRefGoogle ScholarPubMed
Sherlock, C. and Roberts, G. O. (2009). Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets. Bernoulli 15, 774798.CrossRefGoogle Scholar
Sherlock, C., Thiery, A. H. and Golightly, A. (2021). Efficiency of delayed-acceptance random walk Metropolis algorithms. Ann. Statist. 49, 29722990.CrossRefGoogle Scholar
Sherlock, C., Thiery, A. H., Roberts, G. O. and Rosenthal, J. S. (2015). On the efficiency of pseudo-marginal random walk Metropolis algorithms. Ann. Statist. 43, 238275.CrossRefGoogle Scholar
Smith, C. J. (2018). Exact Markov chain Monte Carlo with likelihood approximations for functional linear models. Doctoral Thesis, Ohio State University.Google Scholar
Vats, D., Flegal, J. M. and Jones, G. L. (2019). Multivariate output analysis for Markov chain Monte Carlo. Biometrika 106, 321337.CrossRefGoogle Scholar
Vats, D., Gonçalves, F. B., Łatuszyński, K. and Roberts, G. O. (2022). Efficient Bernoulli factory Markov chain Monte Carlo for intractable posteriors. Biometrika 109, 369385.CrossRefGoogle Scholar
Yang, J., Roberts, G. O. and Rosenthal, J. S. (2020). Optimal scaling of random-walk Metropolis algorithms on general target distributions. Stoch. Process. Appl. 130, 60946132.CrossRefGoogle Scholar
Zanella, G., Bédard, M. and Kendall, W. S. (2017). A Dirichlet form approach to MCMC optimal scaling. Stoch. Process. Appl. 127, 40534082.CrossRefGoogle Scholar
Figure 0

Figure 1. Efficiency (h(l)) versus acceptance rate $(\alpha(l))$ with $I = 1$ (left). Relative efficiency of Barker’s versus MH $(h_{\text{B}}(l)/h_{\text{MH}}(l) )$, plotted against l (right).

Figure 1

Figure 2. Optimal acceptance rate against number of dimensions.

Figure 2

Figure 3. Optimal acceptance rates for $\alpha^{\text{H}}_h$ against h (left) and $\alpha^{\text{R}}_r$ against r (right).

Figure 3

Table 1. Optimal proposal variance and asymptotic acceptance rates.

Figure 4

Figure 4. Convergence times for $\alpha_{\text{B}}$ against acceptance rate in the isotropic setting (top row) and the correlated target setting (bottom row).

Figure 5

Figure 5. Convergence times for $\alpha_{\text{B}}$ (left and middle) and multivariate ESS for the posterior mean vector (right) against acceptance rate.