Hostname: page-component-586b7cd67f-2brh9 Total loading time: 0 Render date: 2024-11-23T07:23:30.242Z Has data issue: false hasContentIssue false

An investigation of the power for separating closely linked QTL in experimental populations

Published online by Cambridge University Press:  14 October 2010

CHEN-HUNG KAO*
Affiliation:
Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, Republic of China
MIAO-HUI ZENG
Affiliation:
Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, Republic of China
*
*Corresponding author: Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, Republic of China. Tel: (02) 2783-5611 ext 418. Fax: (02) 2783-1523. e-mail: [email protected]
Rights & Permissions [Opens in a new window]

Summary

Hu & Xu (2008) developed a statistical method for computing the statistical power for detecting a quantitative trait locus (QTL) located in a marker interval. Their method is based on the regression interval mapping method and allows experimenters to effectively investigate the power for detecting a QTL in a population. This paper continues to work on the power analysis of separating multiple-linked QTLs. We propose simple formulae to calculate the power of separating closely linked QTLs located in marker intervals. The proposed formulae are simple functions of information numbers, variance inflation factors and genetic parameters of a statistical model in a population. Both regression and maximum likelihood interval mappings suitable for detecting QTL in the marker intervals are considered. In addition, the issue of separating linked QTLs in the progeny populations from an F2 subject to further self and/or random mating is also touched upon. One of the primary keys to our approach is to derive the genotypic distributions of three and four loci for evaluating the correlation structures between pairwise unobservable QTLs in the model across populations. The proposed formulae allow us to predict the power of separation when several factors, such as sample sizes, sizes and directions of QTL effects, distances between QTLs, interval sizes and relative QTL positions in the intervals, are considered together at a time in different experimental populations. Numerical justifications and Monte Carlo simulations were provided for confirmation and illustration.

Type
Research Papers
Copyright
Copyright © Cambridge University Press 2010

Introduction

The calculation of statistical power of quantitative trait locus (QTL) detection has been an important problem in QTL mapping. Soller et al. (Reference Soller, Brody and Genizi1976) and Lander & Botstein (Reference Lander and Botstein1989) discussed the power of QTL detection when a QTL is coincident with a genetic marker. Hu & Xu (Reference Hu and Xu2008) developed a simple method to calculate the statistical power of QTL in the interval flanked by its two markers in a population. On the basis of the regression (REG) interval mapping model (Haley & Knott, Reference Haley and Knott1992), their method can predict the power of QTL detection given the factors, such as size and position of QTL, sample size and interval size, by evaluating a non-central F-distribution function. It has been noticed that closely linked QTL might be mistakenly estimated as a single (ghost) QTL with a larger effect at the wrong position if they have the same direction effects, or they might be out of detection if their effects are in the opposite direction (Lander & Botstein, Reference Lander and Botstein1989; Kao & Zeng, Reference Kao and Zeng1997; Ronin et al., Reference Ronin, Korol and Nevo1999). Therefore, the study of separating linked QTLs to improve the QTL resolution remains an important issue. Ronin et al. (Reference Ronin, Korol and Nevo1999) derived the asymptotic expected LOD values based on the non-central chi-squared distribution for the study of two linked QTLs coincident with markers. Mayer (Reference Mayer2005) compared REG interval mapping and maximum likelihood (ML) interval mapping in detecting two linked QTLs using Monte Carlo simulations. So far, analytical methods for the power analysis of detecting linked QTLs situated in the marker intervals have not been fully developed. We propose statistical methods for calculating the power for separating closely linked QTLs located in the intervals. The proposed formulae are based on the information numbers (the inverse of the variance of the best estimated QTL effects), variance inflation factors (caused by the correlations between linked QTLs) and genetic parameters of statistical models in a population. Both REG and ML interval mapping are considered in the formulation. Further, the power analyses of the above-mentioned papers mainly focus on the genome structures of the backcross (BC) and F 2 populations (Hu & Xu also discussed the power calculation in the double haploid (DH) and recombinant inbred lines (RIL)). We also discuss the separation of closely linked QTLs in other experimental populations, such as advanced intercross (AI) and recombinant inbred (RI) populations, subject to more meiosis cycles. One important key to the proposed method is to derive the genotypic distributions of three and four loci to characterize the correlation structures between pairwise QTL variables in the model for different populations. In general, we found that, given a distance between QTLs, separation can be more powerful for QTLs of similar size, with opposite direction effects, located closer to markers and in narrow intervals, and contributing to a high proportion of trait variation. More advanced populations may facilitate the separation of linked QTLs by providing more recombinants and changing genome structures. Numerical and simulated results are presented for confirmation.

Methods

Test statistic for detecting a QTL

Consider an F 2 population or its progeny populations produced by further selfing and/or intercrossing the F 2 individuals for different numbers of generations. There are three possible genotypes, P 1 homozygote, heterozygote and P 2 homozygote for any gene. Let Q jQ j, Q jq j and q jq j be the three possible genotypes of a QTL, say Q j, under consideration in a population. For an individual i in a random sample with size n, let x ij* represent the coded variable of QTL genotype as

(1)

and a j denote its additive effect. Similarly, it is straightforward to construct the coded variable x ik* for another QTL, say Q k, with additive effect a k for a model taking multiple QTLs into account. When Q j, flanked by the left marker M j and right marker N j with alleles (M j, m j) and (N j, n j), is considered, the conditional expectation of x ij* given M j and N j, w ij=E(x ij*|M j, N j), is used as the predictor variable in the REG interval mapping model (Haley & Knott, Reference Haley and Knott1992). For a single QTL model, Hu & Xu (Reference Hu and Xu2008) have shown that the test statistic

(2)

2 is the residual error variance) follows a central F-distribution under the null hypothesis (H 0: a j=0). Under the alternative hypothesis (H 1: a j≠0), this test statistic follows a non-central F-distribution with the non-centrality parameter δ=n×var(w ija j22. The non-centrality parameter is a function of several important factors, sample size, variance of the predictor variable, QTL effect and residual error variance. By analysing these factors, the power for detecting a QTL can be predicted for different situations. For example, Hu & Xu (Reference Hu and Xu1998) analysed one of the key factors, the variance of the coded variable, var(w ij), by deriving its different formulations for the BC, F 2, RIL and DH. When n is sufficiently large, ∑(w ijij)2=n×var(w ij), and the variance of the estimated effect is var(â j)=σ2/[n×var(w ij)]. When var(w ij) is small, var(â j) is large and δ is small, leading to lower power in QTL detection.

Variances of predictor variables

The aim of this study is to calculate the power for detecting two or more closely linked QTLs and to extend the power analysis to the populations beyond F 2 using both REG and ML interval mapping. When analysing the power for detecting one QTL, we only need to understand the asymptotic behaviour of the variances of predictor variables to construct the test statistic for power analysis as has been done by Hu & Xu (Reference Hu and Xu2008) . For dissecting linked QTLs, we should further derive the covariances between different QTL predictor variables to obtain the asymptotic variance–covariance matrix of QTL parameters for power analysis. An important step to obtain the variances and covariances of the predictor variables is to characterize the genotypic distributions of multiple genes in the populations. For example, evaluating E(w ij) and var(w ij) in the BC between a population M jN j/M jN j on F 1 requires considering the four flanking marker genotypes of two genes, M jN j/M jN j, M jN j/M jn j, M jN j/m jN j and M jN j/m jn j with frequencies (1−r)/2, r/2, r/2 and (1−r)/2, where r is the recombination fraction between A and B (Xu, Reference Xu1995). Evaluating them in the F 2 between two populations, M jN j/M jN j and m jn j/m jn j, requires taking into account ten marker genotypes of two genes, M jN j/M jN j, m jn j/m jn j, M jn j/M jn j, m jN j/m jN j, M jN j/M jn j, M jN j/m jN j, M jn j/m jn j, m jN j/m jn j, M jN j/m jn j and M jn j/m jN j with frequencies (1−r)2/4, (1−r)2/4, r 2/4, r 2/4, r(1−r)/2, r(1−r)/2, r(1−r)/2, r(1−r)/2, (1−r)2/2 and r 2/2 (Hu & Xu, Reference Hu and Xu2008). In the progeny populations from F 2, these ten genotypic frequencies change over populations. For AI populations subject to more cycles of random mating, the well-known formula, P′(MjNj)=(1−rP(MjNj)+r×P(M jP(N j), can be used to obtain the genotypic frequencies, where P′(MjNj) is the frequency of MjNj in the next generation. For RI populations subject to further selfing, Haldane & Waddington's transition equations (Reference Haldane and Waddington1931) can be applied to obtain the ten frequencies. Using the same notations as in Haldane & Waddingtion's paper, we denote the frequency of M jN j/M jN j (m jn j/m jn j) genotype as C, the frequency of M jn j/M jn j (m jN j/m jN j) genotype as D, the frequency of M jN j/M jn j (M jN j/m jN j, M jn j/m jn j, or m jN j/m jn j) genotype as E, the frequency of M jN j/m jn j genotype as F and the frequency of M jn j/m jN j genotype as G, respectively, in the populations. With such settings, it is straightforward to show that E(w ij)=0 and to formulate the variance of w ij in a population as

(3)

where p k1 and p k3, k=1, 2, …, 10, are conditional probabilities of Q jQ j and Q jq j genotypes given the ten flanking marker genotypes, and f k are the frequencies of the ten marker genotypes for two flanking markers (C, D, E, F and G). Note that the derivation of p k1 and p k3 is not straightforward as has been done in BC and F 2 populations, and it involves using the genotypic distributions of three genes (Kao & Zeng, Reference Kao and Zeng2009). If the event of double recombinations is ignored within a marker interval, equation (3) can be explicitly formulated as

(4)

where p=r 1/r (r and r 1 are the recombination fractions between (M j, N j) and (M j, Q j)). It is interesting to analyse equation (4) to gain some insight into var(w ij). In equation (4), var(w ij) is bounded by 2(C+D+E), which is the variance of a fully observed QTL-coded variable. The term p(1−p) measures the relative QTL position in a marker interval, and E+2D measures the interval size. As the marker interval becomes wider or the QTLs get closer to the centre position of the interval, E+2D or p(1−p) becomes larger, and the value of var(w ij) becomes smaller. In the F 2 population, 2(C+D+E)=1/2, E+2D=r/2 and var(w ij)=1/2−2rp(1−p), which are bounded by 1/2. In AI F t populations, 2(C+D+E)=1/2 and E+2D=r t/2, where r t=[1−(1−r)t−2(1−2r)]/2. The variance var(w ij)=1/2−2r tp(1−p), which is also bounded by 1/2 (p=r 1t/r t) and decreases in the later populations. In RI populations, 2(C+D+E) is between 1/2 and 1, and E+2D is between 2/r and 2r/(1+2r). The value of var(w ij) increases as population advances. In RIL, 2(C+D+E)=1, E+2D=2r/(1+2r) and var(w ij)=1−(8r/(1+2r))p(1−p), which are bounded by 1. Similarly, the variance of the predictor variable for dominance effect is about ~1/4−r/2×{1−r[(1−2p(1−p))2+2p(1−p)]}, which is bounded by 1/4 in the F 2 population. The variance of the predictor variable is var(w ij)=1/4−rp(1−p) bounded by 1/4 in the BC population (see also Xu, Reference Xu1995). Hu & Xu (Reference Hu and Xu2008) formulated var(w ij) in the F 2, RIL and DH populations when double recombinations in the intervals are considered. In general, the larger the variance of a predictor variable, the greater the power in QTL detection.

Power for detecting a QTL

When only one QTL is considered in the model, Hu & Xu (Reference Hu and Xu2008) showed an example that var(w ij) is 0·450 for a QTL located in the middle of a 10-cM marker interval (r 1=r 2=0·04758 and r=0·09063), and that 252 individuals are required to detect this QTL with 80% power under α=0·01 when the QTL explains 5% of the trait variation in the F 2 population. Our formulae in equation (3) allow us to calculate the values of var(w ij) and sample sizes required in different populations under the same conditions. For the same conditions, the values of var(w ij) derived using our formulae in the different AI and RI populations are presented in Table 1. It shows that the trend in the change of variance behaves differently under selfing and random mating. When further selfing, the variance increases. When successive intercrossing, the variance tends to decrease. For example, the values are 0·651 and 0·806 in the RI F 3 and RIL (generation 10 of RI population is called RIL), respectively, and they are 0·426 and 0·271 in the AI F 3 and AI F 10, respectively. The different values of var(w ij) cause the non-centrality parameter to be different, thus affecting the power of detection. To guarantee an 80% power to detect this QTL under α=0·01, it would require about 175, 155, 148 and 143 individuals in the RI F 3, F 4, F 5 and RIL populations, and it would require 262, 284, 302 and 426 individuals in the AI F3, F4, F5 and F10 populations. This shows that the sample size can be saved in the more advanced RI populations and may not be saved in the later AI populations when mapping a single QTL located in the interval.

Table 1. The values of variances, covariances and correlations of the predictor variables in the AI and RI Ft populations. The case considered is M j-Q j-N j-Q k-N k with , , and

V(x ij): variance of V(x ij). C(x ij, x ik) and ρ(x ij, x ik): covariance and correlation between x ij and x ik. x ij and x ik denote the predictor variables when Q j and Q k are fully observed, and w ij and w ik (x ij* and x ik* denote the predictor variables when Q j and Q k are not observed and constructed from their flanking markers in the REG (ML) interval mapping model.

Covariances between predictor variables

To obtain covariances between the predictor variables, cov(w ij, w ik)'s, we need to understand the genotypic distributions of three and four genes in a population. For two linked QTLs, Q j and Q k, flanked by two marker pairs (M j, N j) and (M k, N k) they can be located in neighbouring or non-neighbouring marker intervals. For the neighbouring case, the order is M j-Q j-N j-Q k-N k (N j and M k are the same marker). For the non-neighbouring case, the order is M j-Q j-N jM k-Q k-N k order. Note that the case for QTLs located in non-neighbouring intervals may include additional markers between N j and M k. For the case of M j-Q j-N jM k-Q k-N k order, the two predictor variables, w ij and w ik, are constructed using the marker pairs (M j, N j) and (M k, N k). Therefore, computing their covariance, cov(w ij, w ik), needs to considered all for the 136 possible genotypes of M j, N j, M k and N k markers (see the Appendix). For the case of M j-Q j-N j-Q k-N k order, obtaining the covariance only needs to evaluate all the 36 marker genotypes of M j, N j and N k markers. The latter case is more difficult to detect Q j and Q k simultaneously as they share the same flanking marker N j. The covariance between w ij and w ik can be generally expressed as

(5)

where n g=36 or 136 and f k are the genotypic frequencies of flanking markers from trigenic and tetragenic distributions. In F 2 population, the genotypic distributions of three and four markers can be obtained from the product of probability distributions of adjacent pairwise genes, i.e. P(MjNjNk)=P(MjNjP(NjNk) and P(MjNjMkNk)=P(MjNjP(NjMkP(MkNk), under the Haldane map function. For example, the gamete frequency P(MjNjMkNk)=(1−r 1)(1−r 2)(1−r 3)/2 in the F 2 population, where r 1, r 2 and r 3 are the recombination fractions between (M j, N j), (N j, M k) and (M k, N k). For the advanced populations beyond F 2, trigenic and tetragenic genotypic distributions cannot be obtained from the direct product of pairwise gene distributions. We use special devises outlined in Kao & Zeng (Reference Kao and Zeng2009) and in the Appendix to obtain the genotypic distributions of three and four genes. Although the covariance in equation (5) does not have a simple form as in equation (4) for variance, it can be easily written into a computer programme to obtain the covariances under different situations in different populations. For example, in the case of M j-Q j-N jM k-Q k-N k order with , , and , the values of cov(w ij, w ik) are 0·7445, 0·6736, 0·6095, 0·5515 and 0·4991 for , 15 20, 25 and 30 cM, respectively, in the F 2 population. In the case of M j-Q j-N jk-Q k-N k order with , , and , its covariances in different populations are presented in Table 1. Table 1 shows that the covariance increases under further selfing and decreases when subjected to more intercrossing. For example, the covariance is 0·409 in the F 2 population. The values are 0·577 and 0·688 in the RI F 3 and RIL, respectively, and they are 0·372 and 0·189 in the AI F 3 and AI F 10, respectively. Although the covariance can become larger or smaller, the correlations between the coded variables, ρ(w ij, w ik), all decrease in the advanced populations (Table 1). The correlation is 0·909 in the F 2 populations. It becomes 0·886 and 0·854 in the RI F 3 and RIL, and it is 0·874 and 0·696 in the AI F 3 and F 10 populations. As will be discussed later, the detection of linked QTLs can benefit from the diminishing correlation between predictor variables in the advanced populations.

Variances of the estimated QTL effects

For a single QTL model, we only need the variance of the coded variable, var(w ij), to construct a test statistic in power analysis (equation (2)). As the variance of the estimated effect is the inverse of the information number of QTL effect, i.e. var(â j)=I −1(a j), for n large, we have

(6)

and var−1(â j)/n=I(a j)/n~var(w ij)/σ2 in a single QTL model. For multiple, say p, QTLs in the model, the variance–covariance matrix of the predictor variables is required in constructing the test statistics. Similarly, for n large, we have I(a)/n=[(WW)/σ2]/n→V(W)/σ2, where W denotes the matrices whose i, jth entry is w ij and V(W) is the variance–covariance matrix with diagonal elements var(w ij)'s, j=1, 2, …, p, and off-diagonal elements cov(w ij, w ik)'s. Under normal assumption, n 1/2(âa)→Np(0,V −1(W)×σ2) (Fuller Reference Fuller1976). Without loss of generality, we present the case of p=2 with Q j and Q k in the model for a better illustration. For p=2, the V −1(W) matrix is

(7)

where . Therefore, the variances of estimated a j and a k are

and

(8)

respectively. By comparing equations (6) with (8), it shows that the variances of the estimated QTL effects are not only affected by var(w ij) and var(w ik) but also by cov(w ij, w ik) through ρ(w ij, w ik). The first term on the right-hand side of equation (8) is usually called variance inflation factor (VIF), which can be also expressed in terms of information numbers, I(a j), I(a k) and I(a j, a k), as

(9)

The VIF can measure the inflation level of the variance of an estimate (Marquardt, Reference Marquardt1970). When ρ(w ij, w ik)=0, VIF=1 and there is no variance inflation. If ρ(w ij, w ik)≠0, VIF>1 indicating that the inflation of variances occurs. In general, large VIF indicates seriously inflated variances and a severe collinearity problem, and the linked QTL are not likely to be detected statistically. For the same M j-Q j-N j-Q k-N k order considered in Table 1, the value of VIF in var(â j) or var(â k) is 5·750 ((1−0·9092)−1), implying that its variance is inflated by 5·750 times as compared to when they are unlinked. The values of VIF are 4·651 and 3·694 in the RI F 3 and RIL, respectively, and they are 4·650 and 1·940 in the AI F 3 and AI F 10, respectively. The values of VIF become smaller in the more advanced populations. Therefore, advanced populations have the ability to provide smaller VIF values for more powerful QTL detection (more explanation is given below). Also, the VIF is generally larger when interval sizes become wider or the putative QTL move towards the centres of intervals (not shown). With VIF, V −1(W) in equation (7) can be simplified in expression as V −1(W)=VIF×A 0, where A 0=[a ij]2×2 denotes the 2×2 matrix in the equation.

Test statistics for detecting linked QTL

We now derive the test statistics for analysing the separation of linked QTL and calculating the separating power. Let

(10)

be the standardized estimated QTL effects, where σj2=VIF×a 11×σ2/n and σk2=VIF×a 22×σ2/n are the variances of the estimated effects (a 11=var−1(w ij) and a 22=var−1(w ij)). As I −1(a j)=(a 11×σ2)/n and I −1(a k)=(a 22×σ2)/n, it is more convenient and succinct to express σj2 and σk2 as

(11)

in a population. Accordingly, the joint distribution of t j and t k follows a bivariate normal distribution with mean zero and covariance matrix with diagonal elements, one, and off-diagonal elements, ρ(w ij, w ik), as

(12)

Given a pre-specified critical value c at the significance level α, the power of separation is the sum of probabilities that t j and t k are simultaneously different from zeros:

(13)

in the bivariate normal distribution. Note that the sum of four probabilities is equivalent to Type I error α under the null hypothesis (H 0: a j=0 and a k=0). Under the alternative hypothesis (H 1: a j≠0 and a k≠0), equation (13) is the power to reject H 0 and allows us to evaluate the power of separation for different values of a j and a k in different populations (see section 4).

When an ML interval mapping is implemented in separating linked QTLs, the model is a normal mixture model under the assumption of normal errors. We use x ij*'s to denote the predictor variables in the ML interval mapping models. By treating x ij*'s as missing data and y i as observed data, we can apply the EM algorithm to obtain the MLE and information matrix by operating on the complete-data likelihood

(14)

For p QTL, there are 3p QTL genotypes, and let μj, j=1, 2, …, 3p, denote their genotypic values. In the complete-data likelihood, the conditional distribution of the observed data given missing data, f(y i|θ, x i1*, …, x ip*), follows a normal distribution Nj, σ2), and g(x i1*, …, x ip*) is a 3p-nomial distribution depending on the values of x ij*'s (QTL genotypes). Let q ij's be the 3p-nomial probabilities derived from the conditional probabilities of QTL genotypes given the flanking marker genotypes. Both MLE and observed information matrix involve the posterior probabilities of the QTL genotypes, (please see Kao & Zeng (Reference Kao and Zeng1997) for more details about the derivations). Therefore, for p=2, evaluating the (expected) information numbers, I(a j), I(a k) and I(a j, a k), needs to integrate the distribution of markers and traits, and thus is more challenging. Here, we suggest a Monte Carlo simulation approach to evaluate the expected π ij by simulating, say 10 000, individuals to approximate the expected π ij as , where denotes the value of πij of each individual. In turn, the information numbers can be obtained. Similarly, to those outlined in REG interval mapping, we can denote I(a j)/n=var(x ij*)/σ2 and I(a j, a k)/n=cov(x ij*, x ik*)/σ2 for sufficiently large n in ML interval mapping. Table 1 presents the values of I(a j) and I(a j, a k) for the same case of M j-Q j-N j-Q k-N k order. The values are obtained by simulating trait values governed by two QTLs with equal effects, and the heritability is h 2=0·05 with σ2=1. As σ2=1, I(a j)=var(x ij*) and I(a j, a k)=cov(x ij*, x ik*). The values of var(x ij*) are 0·437, 0·640 and 0·815 in the F 2, RI F 3 and RIL, respectively, and are 0·428 and 0·310 in the AI F 3 and F 10, respectively (the values of var(x ik) are of very similar size and not presented). As compared to var(w ij) in REG interval mapping, these variances are of similar sizes. The values of cov(x ij*, x ik*) are 0·380, 0·563, 0·691, 0·370 and 0·192 in the F 2, RI F 3, RIL, AI F 3 and AI F 10 populations. Except for the value in generation 10, these values are smaller as compared to the values of cov(w ij, w ik) in REG interval mapping (the values of cov(w ij, w ik) are 0·409, 0·577, 0·688, 0·392 and 0·189, respectively). Also, the values of correlation between the QTL-coded variables can be also obtained (Table 1). In general, the predictor variables in ML interval mapping have smaller covariances (correlations). Therefore, the ML method will have smaller VIF values when fitting closely linked QTL together. The values of VIF are 4·084, 4·299 and 3·232 in the F 2, RI F 3 and RIL, respectively, and are 3·999 and 1·655 in the AI F 3 and AI F 10, respectively. These results indicate that the ML interval mapping suffers a low collinearity problem, and it can be more efficient and powerful in detecting linked QTLs as will be further validated in sections 3 and 4. By obtaining the information numbers of the QTL effects for ML interval mapping, the components in equation (12) can be updated to construct test statistics, t j=(â ja j)/σj and t k=(â ka k)/σk for ML interval mapping. Then, using the bivariate normal distributions, the hypothesis H 0: a j=0 and a k=0 can be tested for calculating the power of ML interval mapping.

When more, say p, QTLs are considered in the REG interval mapping model, the information matrix of parameters is I(a)=(WW)/σ2. It can shown that I(a)/n~V(W)/σ2. As V(W) is invertible, we can express V −1(W)=VIF ×A 0, where A 0=[a ij]p×p. For ML interval mapping, the information matrix can be obtained by using the general formulae of Kao & Zeng (Reference Kao and Zeng1997) . Similarly, when sample size grows large, I(a)/n can be expressed as V(X*)/σ2 (X* denotes the matrix whose i, jth entry is x ij*), whose diagonal elements are the expected I(a j)'s, j=1, 2, …, p and off-diagonal elements are expected I(a j, a k)'s. The V(X*) matrix is also invertible and can be formulated as V −1(X*)=VIF×A 0. For both REG and ML interval mapping, we can define σj2=VIF×a jj×σ2/n, where a jj, j=1, 2, …, p denote the diagonal elements in A 0. Then, we can construct the standardized estimated effects as t j=(â ja j)/σj, j=1, 2, …, p, and (t 1, t 2, …, t p)′ follows a p-variate normal distribution. Given specified critical values, the probability of significance can be calculated (Genz & Bretz, Reference Genz and Bretz2009) to evaluate the power of separating more linked QTLs.

Genetic parameters and residual variances

Further, we know that the relationship between environmental variance, σ2, and genetic variance, V G, can be formulated as , where h 2 is the heritability of quantitative trait variation. The genetic variance can be decomposed into components of genotypic frequencies and QTL effects. For two QTLs with additive effects only, V G=var(x ija j2+var(x ika k2+2×cov(x ij, x ika j×a k, where x ij and x ik denote the coded variables of the two fully observed QTLs (see Kao & Zeng, Reference Kao and Zeng2009 for the components of V G with complete effects and contributed by more QTLs). As var(x ij)=var(x ik)=2(C+D+E) and cov(x ij, x ik)=2(C−D) depend on the genotypic distribution of experimental populations, given specific QTL effects, V G is population dependent. For example, V(x ij)=1/2 and cov(x ij, x ik)=(1−2r t)/2 in AI F t populations, and 1/2<V(x ij)<1 and (1−2r)/2<cov(x ij, x ik)<(1−2r)/(1+2r) in RI F t populations. Therefore, a more detailed formulation of V G can be also expressed as

(15)

Xu (Reference Xu1995) pointed out that the residual variance in REG interval mapping inflates, due to the uncertainty of the QTL genotype, and that the amount of inflation parameter is about [var(x ij)−var(w ij)]×a j2 in a single QTL model. For a multiple QTL model, the amount is about ignoring covariance parts. If the event of double recombinations in the interval is negligible, this amount can be expressed as 4p(1−p)(E+2Da j2 in a single QTL model. For p QTL, Q j, in p distinct intervals, (M j, N j), j=1, 2, …, p, the amount of inflation is about , where p j=r 1j/r j (r j and r 1j are the recombination fractions between (M j, N j) and between (M j, Q j), and E j (D j) is the frequency of M jN j/M jn j (M jn j/M jn j) in the population. There is no inflation if QTLs are completely observed (coincident with markers). Therefore, when QTLs are located at intervals and inferred from flanking markers, the inflation of residual variance reduces the QTL detection power as compared to the power of detecting completely observed QTL (see also section 5).

The above analyses decompose equation (12) into components related to sample size, QTL effects, distance between genes, interval size and genotypic distribution of a population. They pave the way to predict and analyse the power of separation under these factors, across populations and using different methods, and to conduct the QTL analysis when QTLs are completely observed (coincident with markers) or not observable (located in the markers intervals). The validity of proposed formulae in predicting the power of separating linked QTLs is first checked by Monte Carlo simulations, and then the formulae are applied to the power analysis under several mapping factors in different populations.

Simulation

We consider the case of M j-Q j-N jM k-Q k-N j order in the F 2 population. We assume that all markers are 10 cM apart, and the two QTLs are located in the middle of their intervals. We set h 2=0·2, a j=1 and a k=1. With such a setting, var(w ij)=var(w ik)=0·4522 and ρ(w ij, w ik)=0·7445 in REG interval mapping. The predicted powers by REG interval mapping are 2·34, 8·70, 20·09, 34·27, 48·35, 60·63 and 70·61% for n=200, 250, 300, 350, 400, 450 and 500 at α=0·005. For ML interval mapping, var(x ij*)=var(x ik*)=0·4515 and ρ(x ij*, x ik*)=0·7272. The predicted powers by ML interval mapping are 3·90, 12·59, 26·18, 41·45, 55·54, 67·19 and 76·27% for the six different sample sizes at the same α level. Under each case, 200 simulated replicates were generated to obtain the observed powers. The observed power is the proportion of replicates with both test statistics larger than the critical value. For both methods, their observed powers are compared with the predicted powers for each case and plotted in Fig. 1. It indicates that the observed and predicted powers by REG and ML interval mapping are reasonably close to each other under the given sample sizes. Thus, simulation results validate our proposed formulae.

Fig. 1. The predicted and observed powers obtained by ML and REG interval mapping under different sample sizes in the F 2 population. The order considered is M j-Q j-N jM k-Q k-N k. The two QTLs have equal effects and are located in the middle of the 10 cM spaced intervals. The distance between QTLs is 20 cM and h 2=0·2.

Numerical analysis

On the basis of our proposed formulae, numerical analyses of the power of dissecting closely linked QTLs under various mapping factors and in different experimental populations are shown in Figs 2(ad). The factors considered are sample size, QTL effect, interval size and distance between QTLs, and the populations considered include the F 2, AI and RI. Also, both REG and ML interval mapping are applied to the power analysis. In all the cases, we assume h 2=0·2. Figure 2(a) shows the power curves of separating two QTLs located in 10 or 20 cM spaced marker intervals under different distances. The order considered is M j-Q j-N jM k-Q k-N k, and both QTL are located right in the middle of their intervals (, , and in the case of the 10 cM intervals, and , , and in the case of the 20 cM intervals). The distances between QTLs are 20, 25, 30, 35, 40, 45 and 50 cM, respectively (, 15, 20, 25 and 30 cM in the 10 cM intervals, and , 5, 10, 15 and 20 cM in the 20 cM intervals). The two QTLs have equal effects and the sample size is 200. It shows that, given a distance between QTLs, the powers of separation are larger when they are in the narrow intervals. Also, the powers by ML interval mapping is higher than those by REG interval mapping. As mentioned earlier, separating linked QTLs is the most difficult for the case of M j-Q j-N j-Q k-N k (M j-Q j-N k-Q k-N k) order, because they share a common flanking marker. Figures 1 bd present the powers of separating two 10-cM-apart QTLs in the F 2, AI and RI populations for this order. Assume that d MjQj=5 cM, d QjNj=5 cM, d NjQk=5 cM and d QkNk=5 cM, and that the QTLs have equal effects. In Fig. 2 b, with 500 sample size, the powers of REG and ML interval mapping are very low (close to zeros) in the F 2 and F 3 populations. But, the powers increase in the more advanced populations. The powers increase to 0·238 and 0·670 using REG interval mapping in AI F 6 and RI F 6 populations, and they increase to 0·367 and 0·741, respectively, using the ML method. Figure 2 b also presents the powers of separation when Q j and Q k are completely observed (and fitted into the model). As expected, the powers are greater when they are completely observed (the curves with solid and empty triangles). For example, the power is 0·427 in F 2, and it becomes 0·732 and 0·925 in AI F 3 and RI F 3 populations, respectively. The powers gradually attain more than 0·99 for more advanced populations. Figure 2 c shows the powers of separating two fully observed linked QTLs under different sample sizes. The QTLs have equal effects. The powers are about 0, 0·001, 0·037, 0·198 and 0·427 for n=100, 300 and 500, respectively, in the F 2 population, and are 0·059 (0·032), 0·194 (0·518), 0·565 (0·862), 0·815 (0·968) and 0·930 (0·994), respectively, in the AI F 5 (RI F 5) populations. This shows that advanced populations can be much more efficient, and that the RI populations can be more powerful than the AI populations in separation. Figure 2 d illustrates the relations between power and sample size when separating 10-cM-apart QTL with different sizes in the F 2 population. The QTLs are assumed to be completely observed. The powers of separating QTLs with similar size (e.g. a j:a k=1:1) are higher than those of separating QTLs with different size (e.g. a j:a k=2:1), and that the powers for separating QTLs with different direction of effects (e.g. a j:a k=1:−1) is much higher than those with the same direction of effects (e.g. a j:a k=1:1). For example, the powers are 0·236, 0·344, 0·427, 0·981 and 1·000 (0·298) for the effect ratio 1:2, 1:1·5, 1:1, 1:−1·5 and 1:−1 with n=500, respectively. In general, an effective separation of closely linked QTLs requires large n, high h 2, and small ρ and more QTL information in a population.

Fig. 2. (a) Power curves of separating two linked QTLs located in the middle of the 10- or 20-cM-spaced marker intervals under various distances in the F 2 population. The order considered is M j-Q j-N jM k-Q k-N k. The distances between QTLs are 20, 25, 30, 35, 40, 45 and 50 cM, respectively. The two QTLs have equal effects, and n=200. (b) Power curves of separating two 10-cM-apart QTLs when QTLs are coincident with markers (MR) or located in the intervals (REG and ML) in the AI and RI populations. QTLs have equal effects and n=500. The order considered is M j-Q j-N j-Q k-N k. (c) Power curves of separating two 10-cM-apart QTLs under different sample sizes in the AI and RI populations. QTLs have equal effects and are located at markers. (d) Power curves of separating two 10-cM-apart QTLs with different sizes of effects under different sample sizes in the F 2 population. QTLs are assumed to be located at markers. In all cases, h 2=0·2. α=0·005 is chosen as the significant level.

Discussion

QTL mapping is a key approach to the understanding and estimation of the genetic architectures of quantitative traits in quantitative genetics (Zeng et al., Reference Zeng, Kao and Basten1999). In QTL mapping, when QTLs are tightly linked, the estimation of QTL parameters could be easily biased, and the power of detection could be reduced. Therefore, the study of detecting and separating the linked QTLs correctly and efficiently remains an important issue in QTL mapping (Lander & Botstein, Reference Lander and Botstein1989, Ronin et al., Reference Ronin, Korol and Nevo1999; Hu & Xu, Reference Hu and Xu2008). We tackle this issue by developing test statistics to test the effects of QTLs located at the markers or in the intervals. Both the REG and ML interval mapping models are considered. By well characterizing the genotypic distributions of three and four genes, we are able to evaluate the variances and covariances of the predictor variables of QTL in the models, and then to construct test statistics for detecting linked QTLs under more wide-ranging situations. Our proposed test statistics are simple functions of information numbers, VIF and genetic parameters in the models in the populations. They allow us to predict the power of separating linked QTLs under different mapping factors and across different populations. The direct application of our approach to QTL mapping requires the intervals potentially localizing QTL are known for testing. However, those potential intervals are not known before implementing the preliminary analysis. To identify the potential intervals, the use of multi-dimensional search, such as screening all pairs of close intervals, along the whole genomes may not be appropriate, as it can be subjected to a substantial computational burden. In practice, one suggestion is to first use one-QTL model analysis (one-dimensional search) to identify the regions containing potential intervals. In the likelihood profiles of the one-dimensional search, the regions showing significant sign changes in the estimated QTL effects or showing wide and significant peaks (ghost QTL) may indicate containing potential intervals (Haley & Knott, Reference Haley and Knott1992; Kao et al., Reference Kao, Zeng and Teasdale1999; Zeng et al., Reference Zeng, Kao and Basten1999). Then our approach can be applied to these potential intervals for further analysis of closely linked QTLs.

The different advanced populations have different population structures, such as homozygosities, linkage disequilibria (correlations between genes) and genotypic frequencies (Weir, Reference Weir1996). Therefore, they will show different properties in the resolution of closely linked QTLs. When QTLs are linked, their correlation can be generally formulated as 1−2R, where R is the proportion of recombinants in a population. In a population, the closer they are linked, the less recombinants are produced and the stronger the correlation is. Fitting linked QTLs is equivalent to fitting correlated variables into the model, which cause the problems of collinearity in statistical estimation. Consequently, the separation becomes more difficult for closer QTLs as the collinearity problem becomes more severe. The obvious way to relieve the collinearity problem is to increase the proportion of recombinant in a population. In the BC or F 2 populations, the proportion of recombinants is equivalent to the recombination fraction between QTLs (R=r). In the AIL and RIL populations, more recombinants can be produced and accumulated, so that R>r as generation proceeds. Then, these advanced populations would provide smaller VIF and reduce correlations for the QTL parameters to facilitate QTL detection. Nevertheless, we should know that the sizes of marker intervals localizing QTLs may expand (relative to that in the backcross or F 2 population) in the more advanced AI populations (Lynch & Walsh, Reference Lynch and Walsh1998; Kao & Zeng, Reference Kao and Zeng2009) so that the benefit may be offset. Greatly improving separation in the AI populations requires denser markers around the detected QTL (QTL located in narrow intervals), and the improvement would be limited if QTLs are in the sparse marker region (wide intervals). The more powerful separation in the later RI population is also due to the increase of additive variances (accumulation of homozygotes). For example, the additive variance of a QTL in RIL can be twice of that in the F 2 population, and the power of separation can be much higher in the RIL (see Fig. 2 b, c). By well utilizing the properties of genome structures in the later advanced populations, it is possible to improve the resolution of closely linked QTLs in QTL detection.

Given a distance between QTLs, the powers of separating QTL at the markers are greater than those in the intervals (Fig. 2 b). To detect QTLs located in the intervals, the REG or ML interval mapping models have been very popular and used in the separation. In either one of the two statistical models, when the flanking marker intervals become wider or the locations of QTLs are closer to the middle of the intervals, the variances of predictor variables become smaller and their correlations become larger (not shown). Consequently, their detection would be more difficult (Fig. 2 a). Our proposed formulase can take the parameters of QTLs positions and effects and the population structures together into account to predict the power of separation. In general, given a distance between QTL, separation can be more effective for QTLs of similar size, located closer to markers and in narrow intervals, with opposite direction effects, and contributing to a high proportion of trait variation. Also, it is possible to gain more power in QTL detection by utilizing more advanced populations. The results may facilitate the analysis of QTL resolution in the genetic study of quantitative traits.

The authors are grateful to the editor and one anonymous reviewer for helpful comments. We are also grateful to Hsiang-An Ho for writing the computer programme. This work was supported by a grant NSC98-2118-M-001-018 from the National Science Council, Taiwan, Republic of China.

Appendix. Genotypic distribution in advanced populations

Consider an F 2 or advanced population derived from two inbred lines P 1 and P 2. For m genes, there are 2m different gametic genotypes and 22m−1+2m/2 zygotic genotypes. For example, there are 4, 8 and 16 gametic genotypes and 10, 36 and 136 zygotic genotypes for m=2, 3 and 4. As different populations undergo various number of meiosis cycle, the distributions of gametic and zygotic genotypes vary. For selfing, Haldane & Waddington (Reference Haldane and Waddington1931) formulated the transition equations of the ten genotypic frequencies for m=2. Kao & Zeng (Reference Kao and Zeng2009) obtained the transition equations of the 36 genotypic frequencies for m=3. The procedures of obtaining transition equations for m=4 are given below. Let 1 and 0 represent the capital and small-letter alleles, respectively, from P 1 and P 2, so that the configurations of the 16 gametes can be represented as 1111, 0000, 1110, 0001, 1101, 0010, 1011, 0100, 0111, 1000, 1100, 0011, 1010, 0101, 1001 and 0110. In the F 2 population, these 16 gamete frequencies can be obtained the under Haldane map function (using the Markov property), and they are P(1111)=P(0000)=(1−r 1)(1−r 2)(1−r 3)/2, where r 1, r 2 and r 3 are the recombination rates between the first and second genes, between the second and third genes and the third and four genes, respectively. The other frequencies are P(1110)=P(0001)=(1−r 1)(1−r 2)r 3/2, P(1101)=P(0010)=(1−r 1)r 2r 3/2, P(1011)=P(0100)=r 1r 2(1−r 3)/2, P(0111)=P(1000)=r 1(1−r 2)(1−r 3)/2, P(1100)=P(0011)=(1−r 1)r 2(1−r 3)/2, P(1010)=P(0101)=r 1r 2r 3/2 and P(1001)=P(0110)=r 1(1−r 2)r 3/2, respectively. The random unification of these 16 gametes frequencies will produce the 136 different zygotes in a population. If selfing persists after F 2 to produce RI populations, the transition equation for the frequency of genotype is

The above equation contains 41 terms and is derived below. Among all 136 zygotes, 41 of them are capable of producing 1111 gamete with different proportions. Therefore, the frequency of zygote in t+1 generation is equivalent to the sum of frequencies of progeny of these 41 parental zygotes in t generation. The proportion of progeny from parents is 100%. The proportion of progeny from , and parents are 1/4. The proportions from , , and parents are [(1−r 2)(1−r 3)+r 2r 3)]2/4, [r 2(1−r 3)+r 3(1−r 2)]2/4, [r 1(1−r 2)r 3]2/4 and [(1−r 1)r 2r 3]2/4, respectively. Likewise, The proportion of progeny from the other genotypes can be also obtained. Because of symmetry, there are 72 transition equations in total, and the remaining 71 equations can be formulated in a similar way. The complete 72 equations and the computer programme (written in R language) are provided in Supplementary materials. The computer programme is also placed at http://www.stat.sinica.edu.tw/~chkao for download. If mating is random, the transition equations for obtaining the frequencies of gametic genotypes for any m can be derived by using Geiringer's approach (Reference Geiringer1944), and then the frequencies of zygotic genotypes can be obtained.

References

Fuller, W. A. (1976). Introduction to Statistical Time Series. New York: Wiley.Google Scholar
Genz, A. & Bretz, F. (2009). Computation of Multivariate Normal and t probabilities. New York: Springer.CrossRefGoogle Scholar
Geiringer, H. (1944). On the probability theory of linkage in Mendelian heredity. The Annals of Mathematical Statistics 15, 2557.CrossRefGoogle Scholar
Haldane, J. B. S. & Waddington, C. H. (1931). Inbreeding and linkage. Genetics 16, 357374.CrossRefGoogle ScholarPubMed
Haley, C. S. & Knott, S. A. (1992). A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69, 315324.CrossRefGoogle ScholarPubMed
Hu, Z. & Xu, S. (2008). A simple method for calculating the statistical power for detecting a QTL located in a marker interval. Heredity 101, 4852.CrossRefGoogle Scholar
Kao, C. H. & Zeng, Z. B. (1997). General formulas for obtaining the MLE and the asymptotic variance-covariance matrix in mapping quantitative trait loci when using the EM algorithm. Biometrics 53, 653655.CrossRefGoogle Scholar
Kao, C.-H., Zeng, Z.-B. & Teasdale, R. (1999). Multiple interval mapping for quantitative trait loci. Genetics 152, 12031216.CrossRefGoogle ScholarPubMed
Kao, C. H. & Zeng, M. H. (2009). A study on mapping quantitative trait loci in the advanced populations derived from two inbred lines. Genetics Research 91, 8599.CrossRefGoogle Scholar
Lander, E. S. & Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185199.CrossRefGoogle ScholarPubMed
Lynch, M. & Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sunderland, MA: Sinauer Associates.Google Scholar
Marquardt, D. W. (1970). Generalized inverses, ridge regression, biased linear estimation, and Nonlinear Estimation. Technometrics 12, 591612.CrossRefGoogle Scholar
Mayer, M. (2005). A comparison of regression interval mapping and maximum likelihood interval mapping for linked QTL. Heredity 94, 599905.CrossRefGoogle Scholar
Ronin, Y. I., Korol, A. B. & Nevo, E. (1999). Single- and multiple-trait mapping analysis of linked quantitative trait loci: Some asymptotic analytical approximation. Genetics 151, 387396.CrossRefGoogle Scholar
Soller, M., Brody, T. & Genizi, A. (1976). On the power of experimental designs for the detection of linkage between marker loci and quantitative loci in crosses between inbred lines. Theoretical and Applied Genetics 47, 3539.CrossRefGoogle ScholarPubMed
Weir, B. S. (1996). Genetic Data Analysis II. Sunderland, MA: Sinauer Associates.Google Scholar
Xu, S. (1995). A comment on the simple regression method for interval mapping. Genetics 141, 16571659.CrossRefGoogle ScholarPubMed
Zeng, Z. B., Kao, C. H. & Basten, C. (1999). Estimating the genetic architecture of quantitative traits. Genetics Research 74, 279289.CrossRefGoogle ScholarPubMed
Figure 0

Table 1. The values of variances, covariances and correlations of the predictor variables in the AI and RI Ft populations. The case considered is Mj-Qj-Nj-Qk-Nk with {\rm d}_{{M}_{j} { Q}_{j} } \equals 5\,{\rm cM}, {\rm d}_{{ Q}_{j} {N}_{j} } \equals 5\,{\rm cM}, {\rm d}_{{N}_{j} { Q}_{k} } \equals 5\,{\rm cM} and {\rm d}_{{Q}_{k} {N}_{k} } \equals 5\,{\rm cM}

Figure 1

Fig. 1. The predicted and observed powers obtained by ML and REG interval mapping under different sample sizes in the F2 population. The order considered is Mj-Qj-NjMk-Qk-Nk. The two QTLs have equal effects and are located in the middle of the 10 cM spaced intervals. The distance between QTLs is 20 cM and h2=0·2.

Figure 2

Fig. 2. (a) Power curves of separating two linked QTLs located in the middle of the 10- or 20-cM-spaced marker intervals under various distances in the F2 population. The order considered is Mj-Qj-NjMk-Qk-Nk. The distances between QTLs are 20, 25, 30, 35, 40, 45 and 50 cM, respectively. The two QTLs have equal effects, and n=200. (b) Power curves of separating two 10-cM-apart QTLs when QTLs are coincident with markers (MR) or located in the intervals (REG and ML) in the AI and RI populations. QTLs have equal effects and n=500. The order considered is Mj-Qj-Nj-Qk-Nk. (c) Power curves of separating two 10-cM-apart QTLs under different sample sizes in the AI and RI populations. QTLs have equal effects and are located at markers. (d) Power curves of separating two 10-cM-apart QTLs with different sizes of effects under different sample sizes in the F2 population. QTLs are assumed to be located at markers. In all cases, h2=0·2. α=0·005 is chosen as the significant level.

Supplementary material: PDF

Kao supplementary material

Appendix

Download Kao supplementary material(PDF)
PDF 186.9 KB