1. Introduction
Over the past few decades, Markov chain Monte Carlo (MCMC) methods have become an abundantly popular computational tool, enabling practitioners to conveniently sample from complicated target distributions [Reference Brooks, Gelman, Jones and Meng5, Reference Meyn and Tweedie21, Reference Robert and Casella26]. This popularity can be attributed to easy-to-implement accept–reject-based MCMC algorithms for target densities available only up to a proportionality constant. Here, draws from a proposal kernel are accepted with a certain acceptance probability. The choice of the acceptance probability and the proposal kernel can yield varying performances of the MCMC samplers.
Unarguably, the most popular acceptance probability is Metropolis–Hastings (MH), of [Reference Hastings14, Reference Metropolis20], owing to its acknowledged optimality [Reference Billera and Diaconis4, Reference Peskun25]. Efficient implementation of the MH algorithm requires tuning within the chosen family of proposal kernels. For the MH acceptance function, various optimal scaling results have been obtained under assumptions on the proposal and the target distribution. This includes the works of [Reference Bédard3, Reference Neal and Roberts24, Reference Roberts, Gelman and Gilks27, Reference Roberts and Rosenthal28, Reference Roberts and Rosenthal29, Reference Sherlock and Roberts34, Reference Yang, Roberts and Rosenthal40, Reference Zanella, Bédard and Kendall41], among others.
Despite the popularity of the MH acceptance function, other acceptance probabilities remain practically and theoretically relevant. Recently, Barker’s acceptance rule [Reference Barker2] and lazy MH [Reference Łatuszyński and Roberts18] have found use in Bernoulli-factory-based MCMC algorithms for intractable posteriors [Reference Gonçalves, Łatuszyński and Roberts12, Reference Gonçalves, Łatuszyński and Roberts13, Reference Herbei and Berliner15, Reference Smith37, Reference Vats, Gonçalves, Łatuszyński and Roberts39]. Barker’s acceptance function has also proven to be optimal with respect to search efficiency [Reference Menezes and Kabamba19], and it guarantees variance improvements for waste-recycled Monte Carlo estimators [Reference Delmas and Jourdain7]. Further, a class of acceptance probabilities from [Reference Bédard3] has been of independent theoretical interest. We also introduce a new family of generalized Barker’s acceptance probabilities and present a Bernoulli factory for use in problems with intractable posteriors.
To the best of our knowledge, there are no theoretical and practical guidelines concerning optimal scaling outside of MH and its variants (although see [Reference Sherlock, Thiery and Golightly35] for a discussion on delayed-acceptance MH and [Reference Doucet, Pitt, Deligiannidis and Kohn8, Reference Schmon, Deligiannidis, Doucet and Pitt32, Reference Sherlock, Thiery, Roberts and Rosenthal36] for analyses pertaining to pseudo-marginal MCMC). We obtain optimal scaling results for a large class of acceptance functions; Barker’s, lazy MH, and MH are members of this class.
We restrict our attention to the framework of [Reference Roberts, Gelman and Gilks27] with a random-walk Gaussian proposal kernel and a d-dimensional decomposable target distribution. Similar to MH, our general class of acceptance functions require the proposal variance to be scaled by $1/d$ . We find that, typically, for lower acceptance functions, the optimal proposal variance is larger than the optimal proposal variance for MH, implying the need for larger jumps. For Barker’s acceptance rule, the asymptotically optimal acceptance rate (AOAR) is approximately $0.158$ , in comparison to the $0.234$ rate for MH [Reference Roberts, Gelman and Gilks27]. Similar AOARs are presented for other acceptances.
In Section 2 we describe our class of acceptance probabilities, with the main results presented in Section 3. AOARs for Barker’s and other functions are obtained in Section 3.1. In Section 4 we present numerical results in some settings that comply with our assumptions and others that do not. A trailing discussion on the scaling factor for different acceptance functions and generalizations of our results is provided in the last section. All proofs are in the appendices.
2. Class of acceptance functions
Let $\boldsymbol{\pi}$ be the target distribution, with corresponding Lebesgue density $\pi$ and support $\mathcal{X}$ , so that an MCMC algorithm aims to generate a $\boldsymbol{\pi}$ -ergodic Markov chain, $\{X_n\}$ . Let Q be a Markov kernel with an associated Lebesgue density $q(x, \cdot)$ for each $x \in \mathcal{X}$ . We assume throughout that q is symmetric. Furthermore, let the acceptance probability function be $\alpha(x, y)\,:\, \mathcal{X} \times \mathcal{X} \to [0, 1]$ . Starting from an $X_0 \in \mathcal{X}$ , at the nth step, a typical accept–reject MCMC algorithm proposes $y \sim q(X_{n-1}, \cdot)$ . The proposed value is accepted with probability $\alpha(X_{n-1}, y)$ and rejected otherwise, implying that $X_n = X_{n-1}$ . The acceptance function $\alpha$ is responsible for guaranteeing $\boldsymbol{\pi}$ -reversibility and thus $\boldsymbol{\pi}$ -invariance of the Markov chain. Let $a \wedge b$ denote $\min\!(a, b)$ and $s(x,y) = \pi(y) / \pi(x)$ . We define $\mathcal{A}$ , the class of acceptance functions for which our optimal scaling results will hold, as follows.
Definition 1. Each $\alpha \in \mathcal{A}$ is a map $\alpha(x,y)\,:\, \mathcal{X} \times \mathcal{X} \to [0, 1]$ , and for every $\alpha \in \mathcal{A}$ there exists a balancing function $g_{\alpha}\,:\, [0, \infty) \to [0, 1]$ such that
Properties (1) and (2) are standard and easy to verify, with (1) ensuring that intractable constants in $\pi$ cancel away and (2) ensuring $\boldsymbol{\pi}$ -reversibility. Property (3) is not required for $\alpha$ to be a valid acceptance function; however, we need it for our optimal scaling results (to establish Lemma 4), and it holds true for all common acceptance probabilities. Moreover, each $\alpha \in \mathcal{A}$ can be identified by the corresponding $g_{\alpha}$ , and we will use $\alpha$ and $g_{\alpha}$ interchangeably. If $g_{\text{MH}}$ denotes the balancing function for the MH acceptance function $(\alpha_{\text{MH}})$ , then
It is easy to see that $\alpha_{\text{MH}} \in \mathcal{A}$ . The lazy MH $(\alpha_{\text{L}})$ acceptance of [Reference Herbei and Berliner15, Reference Łatuszyński and Roberts18] also belongs to $\mathcal{A}$ . For a fixed $\varepsilon \in [0, 1]$ , it is defined using
Barker’s acceptance function is $\alpha_{\text{B}}(x, y) = g_{\text{B}}(s(x, y))$ for all $x, y \in \mathcal{X}$ , where
Then (2) follows immediately. For differentiable functions, Property (3), i.e. Lipschitz continuity of $g_{\alpha}(e^z)$ , can be verified by bounding the first derivative. In particular, the derivative of $g_{\text{B}}(e^z)$ , given by $e^z/(1 + e^z)^2$ , is bounded by $1/4$ for all $z \in \mathbb{R}$ , and hence $\alpha_{\text{B}} \in \mathcal{A}$ . From [Reference Peskun25], it is well known that in the context of Monte Carlo variability of ergodic averages, MH is superior to Barker’s. Even so, Barker’s acceptance function has had a recent resurgence, aided by its use in Bernoulli-factory MCMC algorithms for Bayesian intractable posteriors where MH algorithms are not implementable.
We present a generalization of (6): for $r \geq 1$ define
For $r \in \mathbb{N}$ , the above can be rewritten as
If $\alpha_r^{\text{R}}$ is the associated acceptance function, then $\alpha_r^{\text{R}} \in \mathcal{A}$ for all $r \geq 1$ . Moreover, $g_1^{\text{R}} \equiv g_{\text{B}}$ and $g_r^{\text{R}} \uparrow g_{\text{MH}}$ as $r \to \infty$ . For $r \in \mathbb{N}$ , we present a natural Bernoulli factory in the spirit of [Reference Gonçalves, Łatuszyński and Roberts13] that generates events of probability $\alpha_r^{\text{R}}$ without explicitly evaluating it; see Appendix D. An alternative approach would be to follow the general sampling algorithm of [Reference Morina, Łatuszyński, Nayar and Wendland23] for rational functions.
Let $\boldsymbol{\Phi}({\cdot})$ be the standard normal distribution function. For a theoretical exposition, [Reference Bédard3] defines the following acceptance probability for some $h > 0$ :
For each $h > 0$ , $\alpha_h^{\text{H}} \in \mathcal{A}$ , and observe that as $h \to 0$ , $g_h^{\text{H}} \to g_{\text{MH}}$ , while as $h \to \infty$ , $g_h^{\text{H}} \to 0$ ; i.e. the chain never moves. Similar examples can be constructed by considering other well-behaved distribution functions in place of $\boldsymbol{\Phi}$ . Lastly, it is easy to see that $\mathcal{A}$ is convex. Thus, it also includes situations when each update of the algorithm randomly chooses an acceptance probability. Moreover, as evidenced in (5), $\mathcal{A}$ is also closed under scalar multiplication as long as the resulting function lies in [0, 1].
3. Main theorem
Let f be a 1-dimensional density function and consider a sequence of target distributions $\{ \boldsymbol{\pi}_d \}$ such that for each d, the joint density is
Assumption 1. The density f is positive and in $C^2$ —the class of all real-valued functions with continuous second-order derivatives. Furthermore, $f^{\prime}/f$ is Lipschitz, and the following moment conditions hold:
Consider the sequence of Gaussian proposal kernels $\big\{ Q_d\big(\boldsymbol{{x}}^d, \cdot\big) \big\}$ with associated density sequence $\{q_d\}$ , so that $Q_d\big(\boldsymbol{{x}}^d, \cdot\big) = N\big(\boldsymbol{{x}}^d, \sigma^2_d\textbf{I}_d\big)$ , where for some constant $l \in \mathbb{R}^{+}$ ,
The proposal $Q_d$ is used to generate a d-dimensional Markov chain, $\boldsymbol{{X}}^d = \big\{\boldsymbol{{X}}^d_n, n \ge 0\big\}$ , following the accept–reject mechanism with acceptance function $\alpha$ . Under these conditions and with $\alpha = \alpha_{\text{MH}}$ , [Reference Roberts, Gelman and Gilks27] established weak convergence to an appropriate Langevin diffusion for the sequence of 1-dimensional stochastic processes constructed from the first component of these Markov chains. Since the coordinates are independent and identically distributed (i.i.d.), this limit informs the limiting behaviour of the full Markov chain in high dimensions. In what follows, we extend the results of [Reference Roberts, Gelman and Gilks27] to the class of acceptance functions $\mathcal{A}$ as defined in Definition 1.
Let $\big\{\boldsymbol{{Z}}^d, d > 1\big\}$ be a sequence of processes constructed by speeding up the Markov chains by a factor of d as follows:
Suppose $\big\{\eta_d\,:\, \mathbb{R}^d \to \mathbb{R}\big\}$ is a sequence of projection maps such that $\eta_d\big(\boldsymbol{{x}}^d\big) = x_1^d$ . Define a new sequence of 1-dimensional processes $\big\{U^d, d > 1\big\}$ as follows:
Under stationarity, we show that $\big\{U^d, d > 1\big\}$ weakly converges in the Skorokhod topology [Reference Ethier and Kurtz10] to a Markovian limit U. We denote weak convergence of processes in the Skorokhod topology by ‘ $\Rightarrow$ ’ and standard Brownian motion at time t by $B_t$ . The proofs are in the appendices.
Theorem 1. Let $\big\{\boldsymbol{{X}}^d, d \ge 1\big\}$ be the sequence of $\boldsymbol{\pi}_d$ -invariant Markov chains constructed using the acceptance function $\alpha$ and proposal $Q_d$ such that $\boldsymbol{{X}}^d_0 \sim \boldsymbol{\pi}_d$ . Further, suppose $\alpha \in \mathcal{A}$ and $\boldsymbol{\pi}_d$ satisfies Assumption 1. Then $U^d \Rightarrow U$ , where U is a diffusion process that satisfies the Langevin stochastic differential equation,
with $h_{\alpha}(l) = l^2 M_{\alpha}(l)$ , where
and
Remark 1. Since $\alpha_{\text{MH}} \in \mathcal{A}$ , our result aligns with [Reference Roberts, Gelman and Gilks27], because
Remark 2. For symmetric proposals, Definition 1 requires $\alpha$ to be a function of only the ratio of the target densities at the two contested points. Thus, the result is not applicable to acceptances in [Reference Banterle, Grazian, Lee and Robert1, Reference Mira22, Reference Vats, Gonçalves, Łatuszyński and Roberts39].
In Theorem 1, $h_{\alpha}(l)$ is the speed measure of the limiting diffusion process and so the optimal choice of l is $l^*$ such that
Denote the average acceptance probability by
and the asymptotic acceptance probability by $\alpha(l) \,:\!=\, \lim_{d \to \infty} \alpha_{d}(l)$ . The dependence on l is through the variance of the proposal kernel. We then have the following corollary.
Corollary 1. Under the setting of Theorem 1, we obtain $\alpha(l) = M_{\alpha}(l)$ , and the asymptotically optimal acceptance probability is $M_{\alpha}(l^*)$ .
Corollary 1 is of considerable practical relevance, since for different acceptance functions it yields the optimal target acceptance probability to tune to.
3.1. Optimal results for some acceptance functions
In Section 2, we discussed some important members of the class $\mathcal{A}$ . Corollary 1 can then be used to obtain the AOAR for them by maximizing the speed measure of the limiting diffusion process. For Barker’s algorithm, from Theorem 1 and (6), the speed measure $h_{\text{B}}(l)$ of the corresponding limiting process is $h_{\text{B}}(l) = l^2M_{\text{B}}(l)$ , where
Maximizing $h_{\text{B}}(l)$ , the optimal value $l^*$ is approximately (see Appendix C)
By Corollary 1, using this $l^*$ yields an asymptotic acceptance rate of approximately $0.158$ . Hence, when the optimal variance is not analytically tractable in high dimensions, one may consider tuning the algorithm so as to achieve an acceptance probability of approximately $0.158$ . Additionally, the right plot in Figure 1 verifies that the relative efficiency of Barker’s versus MH, as measured by the ratio of their respective speed measures for a fixed l, remains above $0.5$ [Reference Łatuszyński and Roberts18, Theorem 4]; this relative efficiency increases as l increases. The ratio of the speed measures of Barker’s versus MH at their respective optimal scaling is $0.72$ . This quantifies the loss in efficiency in running the best version of Barker’s compared to the best version of the MH algorithm. We can also study the respective speed measures as a function of the acceptance rate; this is given in the left plot in Figure 1. We find that as the asymptotic acceptance rate increases, the speed measure for Barker’s decreases more rapidly than that of MH. This suggests that there is much to gain by appropriately tuning Barker’s algorithm.
For lower dimensions, the optimal acceptance rate is higher than the AOAR. Figure 2 shows optimal values for MH and Barker’s algorithms on isotropic Gaussian targets in dimensions 1 to 10, the proposal kernel being the same as in the setting of Theorem 1. This plot is produced using the criterion of minimizing first-order auto-correlations in each component [Reference Gelman, Roberts and Gilks11, Reference Roberts and Rosenthal28, Reference Roberts and Rosenthal29]. For $\alpha_{\text{MH}}$ and $\alpha_{\text{B}}$ , the optimal acceptance rates in one dimension are $0.43$ and $0.27$ respectively.
For lazy MH with $\varepsilon \in [0, 1]$ , Corollary 1 implies that the AOAR of the algorithm is $(1 - \varepsilon)0.234$ with the same optimal $l^*$ as MH. For the acceptance functions $\alpha_h^{\text{H}}$ in (8),
With $h = 0$ , we obtain the result of [Reference Roberts, Gelman and Gilks27] for MH. Further, the left panel of Figure 3 highlights that as $h \to 0$ , the AOAR increases to $0.234$ and the algorithm worsens as h increases. Moreover, for $h \approx 1.913$ , the AOAR is roughly $0.158$ , i.e. equivalent to Barker’s acceptance function.
Lastly, the AOARs for $\alpha_r^{\text{R}}$ in (7) are available. For $r = 1, \dots, 10$ , the results have been plotted in the right plot of Figure 3. As anticipated, the AOAR approaches $0.234$ as r increases. Notice that $\alpha_2^{\text{R}}$ yields an AOAR of $0.197$ , which is a considerable increase from $\alpha_B = \alpha_1^{\text{R}}$ . Table 1 below summarizes the results of this section. (Code for all plots and tables is available at https://github.com/Sanket-Ag/BarkerScaling.)
4. Numerical results
We study the estimation quality for different expectations as a function of the proposal variance (acceptance rate) for the generalized Barker acceptance function, $\alpha_r^{\text{R}}$ . We focus on $r = 1$ (Barker’s algorithm) and $r = 2$ . Suppose $f\,:\,\mathbb{R}^d \to \mathbb{R}$ is the function whose expectation with respect to $\boldsymbol{\pi}_d$ is of interest. Let $\{f(\boldsymbol{{X}}_n)\}$ be the mapped process. Similarly to [Reference Roberts and Rosenthal29], we assess the choice of proposal variance by the convergence time:
where $\rho_k$ is the lag-k autocorrelation in $\{f(\boldsymbol{{X}}_n)\}$ . In each of the following simulations, convergence time is estimated by averaging over $10^3$ replications of Markov chains, each of length $10^6$ with $k = 1$ . We chose a range of values of l where l is such that $\sigma^2_d = l^2/d$ in a Gaussian proposal kernel $Q_d\big(\boldsymbol{{x}}^d, \cdot\big) = N\big(\boldsymbol{{x}}^d, \sigma^2_d \textbf{I}_d\big)$ . Consider first the case of an isotropic target, $\boldsymbol{\pi}_{d} = N_{d}(\boldsymbol{0}, \textbf{I}_{d})$ with isotropic Gaussian proposals; the conditions of Theorem 1 are satisfied. The estimated convergence time for $f(\boldsymbol{{x}}) = x_1$ and $f(\boldsymbol{{x}}) = \bar{{x}}$ , where $\bar{{x}}$ is the mean of all components $x_1, \dots, x_d$ , is plotted in Figure 4 (top row). Here, $d = 50$ . For both functions of interest, the optimal performance, i.e. the minimum convergence time, corresponds to an acceptance rate of approximately $0.158$ for $\alpha_{\text{B}}$ and $0.197$ for $\alpha_2^{\text{R}}$ ; the slight overestimation is due to the finite-dimensional setting. Next, we consider $\boldsymbol{\pi}_d = N_d(\boldsymbol{0}, \boldsymbol{\Sigma}_d)$ where $\boldsymbol{\Sigma}_d$ is a $d \times d$ matrix with 1 on its diagonal and all other elements are equal to some non-zero $\rho$ . Here, the assumptions in Theorem 1 are not satisfied. For such a target and for $\alpha_{\text{MH}}$ , [Reference Roberts and Rosenthal29] showed that the rate of convergence of the algorithm is governed by the eigenvalues of $\boldsymbol{\Sigma}_d$ . In particular, the eigenvalues of $\boldsymbol{\Sigma}_d$ are $d\rho + 1 - \rho$ and $1 - \rho$ , with associated eigenvectors $\boldsymbol{{y}}$ such that $\boldsymbol{{y}}^T\boldsymbol{{x}}$ yields $\bar{{x}}$ and $x_i - \bar{{x}}$ (for $i = 1,\dots,d$ ), respectively. Then, it was shown that the algorithm converges quickly for functions orthogonal to $\bar{{x}}$ , but much more slowly for $\bar{{x}}$ . Despite the differing rates of convergence, the optimal acceptance rate, corresponding to the minimum convergence time, remains the same. We find this also to be true for $\alpha_{\text{B}}$ and $\alpha^{\text{R}}_2$ as illustrated in Figure 4 (bottom row), where we present convergence times for $x_1 - \bar{{x}}$ and $\bar{{x}}$ . Once again, $d = 50$ . The large difference between convergence times for both is quite evident from the y-axis of the two plots. The minimum again lies in a region around the asymptotic optimal. We note that because of the slow convergence rate of $\bar{{x}}$ , the process demonstrates slow mixing, yielding more variable estimates of the convergence time. For both simulation settings, we see the expected improvement in the convergence time for $\alpha_2^{\text{R}}$ compared to $\alpha_{\text{B}}$ .
4.1. A Bayesian logistic regression example
We consider fitting a Bayesian logistic regression model to the famous Titanic dataset, which contains information on crew and passengers aboard the 1912 RMS Titanic. Let $\boldsymbol{{y}}$ denote the response vector (indicating whether each person survived or not), and let $\boldsymbol{{X}}$ denote the $n \times d$ model matrix; here $d = 10$ . We assume a multivariate zero-mean Gaussian prior on $\boldsymbol{\beta}$ with covariance $100\textbf{I}_{10}$ . The resulting target density is
For the Titanic dataset, the resulting posterior has a complicated covariance structure, with many components exhibiting an absolute mutual correlation of beyond .50. The posterior is also ill-conditioned, with the condition number of the estimated target covariance matrix being $\approx 10^5$ . As seen in the bottom row of Figure 4, in such situations an isotropic proposal kernel might perform poorly for most functions. We instead consider a Gaussian proposal scheme where the proposal covariance matrix is taken to be proportional to the target covariance matrix. This is a common strategy for dealing with targets with correlated components and forms the basis for many adaptive MCMC kernels [Reference Roberts and Rosenthal30]. We implement Barker’s algorithm to sample from the posterior. Let $\boldsymbol{\Sigma}_d$ denote the covariance matrix associated with the posterior distribution of $\boldsymbol{\beta}$ ; then the proposal kernel $Q_d\big(\boldsymbol{{x}}^d, \cdot\big) = N\big(\boldsymbol{{x}}^d, \sigma^2_d \boldsymbol{\Sigma}_d\big)$ . Since $\boldsymbol{\Sigma}_d$ is unavailable, we estimate it from a pilot MCMC run of size $10^7$ . We then consider various values of $\sigma^2_d = l^2/d$ .
The performance of the algorithm for different functions of interest is plotted in Figure 5. Since this is a 10-dimensional problem, the optimal acceptance rate from Figure 2 is approximately $0.18$ . The convergence times for both, $\beta_1 - \bar{\beta}$ and $\bar{\boldsymbol{\beta}}$ , are similar. Furthermore, both are minimized at approximately the same acceptance rate of $0.18$ . It is natural here to be interested in estimating the posterior mean vector. Thus, we also study the properties of the vector $\boldsymbol{\beta}$ , with efficiency measured via the multivariate effective sample size (ESS) [Reference Vats, Flegal and Jones38]. The ESS returns the equivalent number of i.i.d. samples from $\boldsymbol{\pi}$ that would yield the same variability in estimating the posterior mean as the given set of MCMC samples. In Figure 5, we see that the optimal acceptance rate, corresponding to the highest ESS values, is achieved around $0.18$ .
5. Conclusions
We have obtained optimal scaling and acceptance rates for a large class of acceptance functions. In doing so, we have found that the scaling factor of $1/d$ for the proposal variance holds for all acceptance functions, indicating that the acceptance functions are not likely to affect the rate of convergence, just the constants associated with that rate. Thus, practitioners need not hesitate in switching to other acceptance functions when the MH acceptance probability is not tractable, as long as Corollary 1 is used to tune their algorithm accordingly. There is also an inverse relationship between optimal variance and AOAR (see Table 1), implying that when dealing with sub-optimal acceptance functions, the algorithm seeks larger jumps. The computational cost of the Bernoulli factory that we present for $\alpha_r^{\text{R}}$ in Appendix D increases with r. Given the large jump in the optimal acceptance probability from $r = 1$ to $r = 2$ , the development of more efficient Bernoulli factories is an important problem for future work. The assumption of starting from stationarity is a restrictive one. For MH with Gaussian proposals, the scaling factor of $1/d$ is still optimal when the algorithm is in the transient phase [Reference Christensen, Roberts and Rosenthal6, Reference Jourdain, Lelièvre and Miasojedow16, Reference Kuntz, Ottobre and Stuart17]. The optimal acceptance probability may vary depending on the starting distribution. We envision that similar results are viable for the general class of acceptance functions, and this is important future work. Our results are limited to only Gaussian proposals and trivially decomposable target densities. Other proposal distributions may make use of the gradient of the target, e.g. the Metropolis-adjusted Langevin algorithm [Reference Roberts and Tweedie31] and Hamiltonian Monte Carlo [Reference Duane, Kennedy, Pendleton and Roweth9]. In problems where $\alpha_{\text{MH}}$ cannot be used, the gradient of the target density is likely unavailable; thus it is reasonable to limit our attention to a Gaussian proposal. On the other hand, generalizations to other target distributions are important. For MH algorithms, [Reference Bédard3, Reference Sherlock and Roberts34] relax the independence assumption, while [Reference Roberts and Rosenthal29] relax the identically distributed assumption. Additionally, [Reference Yang, Roberts and Rosenthal40] present a proof of weak convergence for MH for more general targets, and [Reference Schmon and Gagnon33] provide optimal scaling results for general Bayesian targets using large-sample asymptotics. In these situations, extensions to other acceptance probabilities are similarly possible. Additionally, we encourage future work in optimal scaling to leverage our proof technique to demonstrate results for the wider class of acceptance probabilities.
Appendix A. Proof of Theorem 1
The proof is structurally similar to the seminal work of [Reference Roberts, Gelman and Gilks27], in that we will show that the generator of the sped-up process, $\textbf{Z}^d$ , converges to the generator of an appropriate Langevin diffusion. Define the discrete-time generator of $\boldsymbol{{Z}}^d$ as
for all those V for which the limit exists. Since our interest is in the first component of $\boldsymbol{{Z}}^d$ , we consider only those V which are functions of the first component only. Now, define the generator of the limiting Langevin diffusion process with speed measure $h_{\alpha}(l)$ as
The unique challenge in our result is identifying the speed measure $h_{\alpha}(l)$ for a general acceptance function $\alpha \in \mathcal{A}$ . Proposition 1 is a key result that helps us obtain a form of $h_{\alpha}(l)$ without resorting to approximations.
To prove Theorem 1, we will show that there are events $F_d \subseteq \mathbb{R}^d$ such that for all t,
for a suitably large class of real-valued functions V. Moreover, because of the conditions of Lipschitz continuity on $f^{\prime}/f$ , a core for the generator G has domain $C_c^{\infty}$ , the class of infinitely differentiable functions with compact support [Reference Ethier and Kurtz10, Theorem 2.1, Chapter 8]. Thus, we can limit our attention to only those $V \in C_c^{\infty}$ that are a function of the first component.
Consider now the setup of Theorem 1. Let $w = \log f$ and $\alpha \in \mathcal{A}$ with the balancing function $g_{\alpha}$ . Let w ′ and w ′′ be the first and second derivatives of w, respectively. Define the sequence of sets $\{F_d \subseteq \mathbb{R}^d, d > 1 \}$ by
The following results from [Reference Roberts, Gelman and Gilks27] will be needed.
Lemma 1. ([Reference Roberts, Gelman and Gilks27].) Let Assumption 1 hold. If $\boldsymbol{{X}}^d_0 \sim \boldsymbol{\pi}_d$ for all d, then, for a fixed t, $\mathbb{P}[\boldsymbol{{Z}}_s^d \in F_d,\ 0 \le s \le t] \to 1 \text{ as } d \to \infty\,.$
Lemma 2. ([Reference Roberts, Gelman and Gilks27].) Let Assumption 1 hold. Also, let
where $Y_i \overset{\text{ind}}{\sim} N\big(x_i, \sigma^2_d\big)$ , $i = 2, \dots, d$ . Then $ \sup_{\boldsymbol{{x}}^d \in F_d}\mathbb{E}\!\left[\left|W_d\big(\boldsymbol{{x}}^d\big)\right|\right] \to 0\,$ .
Lemma 3. ([Reference Roberts, Gelman and Gilks27].) For $Y \sim N\big(x, \sigma^2_d\big)$ and $V \in C_c^{\infty}$ ,
For the following proposition, we will utilize the property (2) imposed on $\mathcal{A}$ . This proposition is the key to obtaining our main result in such generality.
Proposition 1. Let $X \sim N({-}\theta/2, \theta)$ for some $\theta > 0$ . Let $\alpha \in \mathcal{A}$ with the corresponding balancing function $g_\alpha$ . Then $\mathbb{E}\!\left[ Xg_{\alpha}\big(e^X\big)\right] = 0$ .
Proof. We have
the second inequality follows from the assumption that $g_{\alpha}$ lies in [0, 1]. Hence, the expectation exists and is equal to the integral
Observe that, using (2),
Hence, the result follows.
Lemma 4. Suppose $V \in C_c^{\infty}$ is restricted to only the first component of $\boldsymbol{{Z}}^d$ . Then
Proof. In the expression for $G_dV\big(\boldsymbol{{x}}^d\big)$ given in (11), we can decompose the proposal $\boldsymbol{{Y}}^d$ into $(Y_1^d, \boldsymbol{{Y}}^{d-})$ and thus rewrite the expectation as follows:
Let $E^{d, \alpha}$ denote the inner expectation in (13) and define $E_{lim}^{d, \alpha}$ as
Also, a Taylor series expansion of w about $x_i^d$ for $i = 2, \dots, d$ gives
for $Z_i$ lying between $x_i^d$ and $Y_i^d$ . Hence, the triangle inequality and Lipschitz continuity of $g(e^z)$ give, for some Lipschitz constant $K < \infty$ ,
where $W_d\big(\boldsymbol{{x}}^d\big)$ is as defined in Lemma 2. From Lemma 2, Lemma 3, and (15),
Now let $\varepsilon(y) = \log f(y) - \log f\big(x_1^d\big)$ . Also, from (14), it is clear that given $\boldsymbol{{x}}^d$ , $E^{d,\alpha}_{lim}$ is a function of $Y_1^d$ alone, to wit,
where $B_d \sim N(\mu_d, \Sigma_d)$ with $ \mu_d = \varepsilon\big(Y_1^d\big) - l^2R_d/2$ and $\Sigma_d = l^2R_d$ . Thus, by (15), it is enough to consider the asymptotic behaviour of
Let $N_{d, \alpha} = M_{d, \alpha} \circ \varepsilon$ and apply Taylor series expansion on the inner term to obtain
where $K_d, L_d \in \big[Y_1^d, x_1^d\big]$ or $\big[x_1^d, Y_1^d\big]$ , and
Now, for all d,
The derivatives and integral here can be exchanged thanks to the dominated convergence theorem, which yields
where the first term vanishes by Proposition 1. Hence, for all d,
Now, we plug the expressions obtained above into the Taylor series expansion of $\left(V\big(Y^d_1\big) - V\big(x^d_1\big)\right)M_{d, \alpha}(\varepsilon\big(Y_1^d\big))$ . The rest of the proof, with the help of Assumption 1, follows similarly as in [Reference Roberts, Gelman and Gilks27, Lemma 2.6].
Proof of Theorem 1. From Lemma 4, we have uniform convergence of generators on the sequence of sets with limiting probability 1. Thus, by Corollary 8.7 in [Reference Ethier and Kurtz10, Chapter 4], we have the required result of weak convergence (the condition that $C_c^{\infty}$ separates points was verified by [Reference Roberts, Gelman and Gilks27]).
Appendix B. Proof of Corollary 1
Lemma 5. Let $E^{d, \alpha}$ be the inner expectation in (13), and let $E_{lim}^{d, \alpha}$ be from (14). Then
Proof. Consider
The second term goes to 0, since the expectation is bounded and by construction $P\big(\boldsymbol{{x}}^d \in F_d^C\big)\to 0$ as $d \to \infty$ . Also, following [Reference Roberts, Gelman and Gilks27],
Then
Proof of Corollary 1. Consider Equation (17). Using the Taylor series approximation of second order around $x_1$ ,
where $W_{d,1} \in \big[x_1^d, Y_1^d\big]$ or $\big[Y_1^d, x_1^d\big]$ . Since N ′′ is bounded [Reference Roberts, Gelman and Gilks27],
As all expectations exist, we can split the inner expectation and use Lemma 5, so that
The last equality is by the law of large numbers and the continuous mapping theorem.
Appendix C. Optimizing speed for Barker’s acceptance
We need to maximize $h_{\text{B}}(l) = l^2M_{\text{B}}(l)$ . Let I be fixed arbitrarily. Then
For a fixed I, we can reparametrize the function by taking $\theta = l^2I$ , and so maximizing $h_{\text{B}}(l)$ over positive l will be equivalent to maximizing $h^1_{\text{B}}(\theta)$ over positive $\theta$ , where
We make the substitution $z = (b + \theta/2)/\sqrt{\theta}$ in the integrand to obtain
where the expectation is taken with respect to $Z \sim N(0,1)$ . This expectation is not available in closed form. However, standard numerical integration routines yield the optimal value of $\theta$ to be $6.028$ . This implies that the optimal value of l, say $l^*$ , is approximately equal to
Using this $l^*$ yields an AOAR of approximately $0.158$ .
Appendix D. Bernoulli factory
To sample events of probability $\alpha_B$ , the two-coin algorithm, an efficient Bernoulli factory, was presented in [Reference Gonçalves, Łatuszyński and Roberts13]. Generalizing this to a die-coin algorithm, we present a Bernoulli factory for $\alpha_r^{\text{R}}$ for $r =2$ ; extensions to other r can be done similarly. Let $\pi(x) = c_x p_x$ with $p_x \in [0, 1]$ and $c_x > 0$ . Then
Acknowledgements
The authors thank the referees and the editor for comments that helped improve the work.
Funding information
D. Vats is supported by SERB grant SPG/2021/001322; K. Łatuszyński is supported by the Royal Society through the Royal Society University Research Fellowship; and G. Roberts is supported by the EPSRC grants CoSInES (EP/R034710/1) and Bayes for Health (EP/R018561/1).
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.