1. Introduction
Linkage disequilibrium (LD) refers to the non-random association of alleles at different loci in the genome. Historically, LD analysis was developed to study the genetic structure and diversity of natural populations (Lewontin, Reference Lewontin1964; Hill, Reference Hill1974; Hedrick, Reference Hedrick1987; Weir, Reference Weir1996). In recent years, there has been a dramatic increase of interest in utilizing LD to infer the evolutionary history and process of human populations (Reich et al., Reference Reich, Cargill, Bolk, Ireland, Sabeti, Richter, Lavery, Kouyoumjian, Farhadian, Ward and Lander2001; Ardlie et al., Reference Ardlie, Kruglyak and Seielstad2002; Dawson et al., Reference Dawson, Abecasis, Bumpstead, Chen, Hunt, Beare, Pabial, Dibling, Tinsley, Kirby, Carter, Papaspyridonos, Livingstone, Ganske, Lohmmussaar, Zernant, Tonisson, Remm, Magi, Puurand, Vilo, Kurg, Rice, Deloukas, Mott, Metspalu, Bentley, Cardon and Dunham2002; Gabriel et al., Reference Gabriel, Schaffner, Nguyen, Moore, Roy, Blumenstiel, Higgins, DeFelice, Lochner, Faggart, Liu-Cordero, Rotimi, Adeyemo, Cooper, Ward, Lander, Daly and Altshuler2002) and to identify genes for disease or yield traits by association analyses with DNA-based markers (Remington et al., Reference Remington, Thornsberry, Matsuoka, Wilson, Whitt, Doebley, Kresovich, Goodman and Buckler2001; Ardlie et al., Reference Ardlie, Kruglyak and Seielstad2002). The efficacy of LD analysis in population genetic studies and gene mapping depends on the level of LD in the population studied, its distribution and heterogeneity across the genome, and its relationship with genetic or geographic distances (Farnir et al., Reference Farnir, Coppieters, Arranz, Berzi, Cambisano, Grisart, Karim, Marcq, Moreau, Mni, Nezer, Simon, Vanmanshoven, Wagenaar and Georges2000; McRae et al., Reference McRae, MceWan, Dodds, Wilson, Crawford and Slate2002; Liu et al., Reference Liu, Todhunter, Lu, Schoettinger, Li, Littell, Bliss, Acland, Lust and Wu2006). The pattern of LD decay is well known in human populations, where the relationship between the LD and physical distance is graphically elucidated to infer the origin and evolution of humans (Tishkoff et al., Reference Tishkoff, Dietzsch, Speed, Pakstis, Kidd, Cheung, Bonne-Tamir, Santachiara-Benerecetti, Moral, Krings, Pääbo, Watson, Risch, Jenkins and Kidd1996, Reference Tishkoff, Varkonyi, Cahinhinan, Abbes, Argyropoulos, Destro-Bisol, Drousiotou, Dangerfield, Lefranc, Loiselet, Piro, Stoneking, Tagarelli, Tagarelli, Touma, Williams and Clark2001; Tishkoff & Williams, Reference Tishkoff and Williams2002).
The pattern of LD decay with genetic distance (measured in terms of the recombination fraction) can be used to characterize the genetic structure and dynamics of populations. This needs the simultaneous measures of the LD and recombination fraction between the same pair of loci. However, the estimation of these two parameters is usually based on different genetic designs; i.e. the estimation of the recombination fraction relies on a segregating pedigree, whereas the estimation of LD needs a random sample drawn from a natural population. More recently, several designs have been proposed to jointly measure the linkage and linkage disequilibrium for natural populations (Wu & Zeng, Reference Wu and Zeng2001) and domestic animals (Georges, Reference Georges2007). Simultaneous estimation of LD and the recombination fraction can avoid false positive results (spurious LD) when LD is used to fine-map genes for complex traits given the frequent occurrence of LD between distantly spaced loci or unlinked loci.
Wu & Zeng (Reference Wu and Zeng2001) proposed an open-pollinated (OP) design for population genetic studies of forest trees with molecular markers. For most forest tree species, seeds from a single mother tree are derived from the open pollination of unknown fathers from the pollen pool. By collecting OP seeds from a sample of individual trees in a natural population, Wu & Zeng's design did not take into account the hermaphroditical nature of a tree species in which both sexes exist on the same individual, and thus its seeds may be derived from both selfing and outcrossing pollination. Self-fertilization is thought to affect diversity by reduced effective population size and reduced genome-wide effective recombination rates, both due to increased homozygosity, elevated isolation among individuals and subpopulations induced by inbreeding (Charlesworth, Reference Charlesworth2003). Consequently, a predominantly selfing mode of reproduction may be expected to lead to low polymorphism, extensive LD and high population subdivision (Nordborg, Reference Nordborg2000; Ingvarsson, Reference Ingvarsson2002, Reference Ingvarsson2005). These predictions can be tested by simultaneous estimation of the outcrossing rate, recombination fraction and LD. Also, the OP design proposed by Wu & Zeng (Reference Wu and Zeng2001) did not incorporate a procedure for estimating the diplotype of heterozygous trees from which seeds are sampled. In this paper, we extend Wu & Zeng's OP progeny design to better understand the genetic structure of a natural population by simultaneously estimating multiple population genetic parameters with molecular markers. Simulation studies were performed to examine the statistical behaviour of the model.
Model
(i) Sampling and genotyping strategy
Suppose there is a natural population at Hardy–Weinberg equilibrium (HWE) for a dioecious plant species. Each plant in the population is OP by its own pollen (selfing) and randomly by the pollen from other individuals (outcrossing). Thus, seeds produced by each plant include a mix of offspring due to selfing and outcrossing pollination. We will randomly sample a set of maternal plants and further randomly collect a sample of seeds from each sampled plant. Because the fathers of seeds from a sampled maternal plant are unknown, this sampling strategy will generate a set of half-sib families. The collected seeds (embryos) are germinated into seedlings. DNA samples are taken from maternal plants and their offspring derived from the seeds for marker analysis.
A panel of molecular markers is typed to examine population genetic properties by estimating the recombination fraction, LD and outcrossing rate. Consider two markers, each with two alleles, 1 and 0, which are generally denoted by i for the first marker (i=1, 0) and j for the second marker (j=1, 0). Different alleles at each marker unite to form four gametes, whose frequencies in the population are expressed as
which sum to one, where p and 1−p are the frequencies of two alleles, 1 and 0, for the first marker, q and 1−q are the frequencies of two alleles for the second marker, and D is the degree of gametic LD between the two markers. It is assumed that there is no sex-specific difference in gamete frequencies, allele frequencies and LD in the population.
Among the sampled maternal plants, there are nine genotypes for the two markers considered, generally expressed as ii′jj′ (i⩾i′=1, 0; j⩾j′=1, 0). Let N ii′jj′ denote the number of maternal plants with genotype ii′jj′. Under the assumption of HWE, the frequency of a diplotype is the product of the frequencies of the gametes that form the diplotype. By collapsing those diplotypes that are observed as the same genotype, the frequencies of genotypes are generally expressed as
Table 1 gives the genotype frequencies of the maternal plants for the two markers in terms of haplotype or diplotype frequencies calculated with equation (2).
The seeds collected from each sampled maternal plant are typed for the two markers so that the genotype of each offspring can be known. The same offspring genotype from the same maternal genotype are mixed up. Let be the mixed number of offspring with genotype ll′rr′ (l⩾l′=1, 0; r⩾r′=1, 0) collected from N ii′jj′ maternal plants with genotype ii′jj′. From observed offspring genotypes, we will estimate key population genetic parameters that define population structure and organization.
(ii) Offspring structure
Each sampled maternal plant undergoes meiosis to produce male and female gametes. For those double homozygotes at the two markers considered, only one gamete type is yielded. The plants which are heterozygous only for one marker produce two types of gametes with equal frequency. For the double heterozygote plants, there are four possible types of gametes: 11, 10, 01 and 00. This type of plant has two possible diplotypes 11|00 and 01|10, which will produce different gamete frequencies expressed as a function of the recombination fraction (r) (Table 2). Of all double heterozygotes, there is a relative proportion of
for diplotype 11|00, and
for diplotype 10|01.
Each female gamete produced by a maternal plant unites at random with its own male gamete to form a selfing offspring, or with a gamete from the pollen pool to form an outcrossing offspring. Table 1 lists the frequencies of two-marker diplotypes (and therefore genotypes) in the selfing and outcrossing offspring populations produced by possible maternal genotypes. The pollen pool that contributes to the outcrossing seeds contains four male gametes, 11, 10, 01 and 00, whose frequencies are defined by p 11, p 10, p 01 and p 00, respectively. Let w be the outcrossing rate of the plant measured by the proportion of its offspring that are generated through fertilization by pollens of other plants in the population. Thus, the selfing rate of the plant that receives its own pollen to pollinate is .
Although marker genotypes of offspring sampled from a given maternal genotype can be observed, the mechanisms of genotype formation are unknown. The formation of progeny genotypes includes four mechanisms:
(1) Pollination behaviour: The same progeny genotype can be derived from the selfing or outcrossing of a maternal plant. Let denote the overall frequency of offspring genotype ll′rr′ derived from maternal genotype ii′jj′, which is generally expressed, by considering all possible mechanisms of genotype formation, as
(3)where and are the frequencies of offspring genotype ll′rr′ derived from maternal genotype ii′jj′ due to the maternal plant's selfing and outcrossing pollination, respectively (Table 1).(2) Sex origin of a gamete: The same progeny genotype may be due to reciprocal combinations between two gametes from male and female sides. For example, a maternal genotype 11/10 yields two female gametes, 11 and 10. When they unite with male gametes 10 and 11, respectively, the same progeny genotype results.
(3) The complementarity of gametes: If two female gametes of a maternal genotype are complementary to those of the pollen pool, such combinations will produce the same outcrossing progeny genotype. For example, two gametes of maternal genotype 11/10, 11 and 10 are respectively combined with complementary male gametes from the pollen pool, 00 and 01, to generate the same progeny genotype 10/10.
(4) Double heterozygote of a maternal plant: This type of plant contains two different diplotypes which produce the same arrays of gametes but with different relative proportions (see Table 2).
(iii) Likelihood and estimation
Based on the structure of offspring genotypes in Table 1, we construct a log likelihood for parameters Θ=(p 11, p 10, p 01, p 00, w, r) as
Likelihood (4) is implemented with the Expectation–Maximization (EM) algorithm to obtain the maximum likelihood estimates (MLEs) of Θ.
(a) Estimation of gamete frequencies
In the E step, calculate the expected numbers of gametes 11, 10, 01 and 00 within each observed offspring genotype derived from a maternal genotype, expressed as
Tables 3–6 provide the formulae for estimating the expected numbers of gametes 11, 10, 01 and 00, respectively. In the M step, estimate the gamete frequencies using
Estimation of the recombination fraction
In the E step, calculate the expected number of r within an offspring genotype derived from a maternal plant of double heterozygote using
In the M step, estimate the recombination fraction using
Estimation of outcrossing rate
In the E step, calculate the expected number of w within each possible offspring genotype derived from a maternal genotype using
In the M step, calculate the outcrossing rate using
Note that not all offspring genotypes contain w if they are not derived from a double heterozygote maternal plant (see Table 1), although the summation over all possible genotypes is generally given.
An iterative loop of E and M steps between equations (5), (7) and (9) and equations (6), (8) and (10) is constructed to estimate the parameters. After haplotype frequencies are estimated, allele frequencies at two markers and their LD (D) are estimated as
(iv) Hypothesis testing
The genetic parameters estimated, i.e. the LD (D), outcrossing rate (w) and recombination fraction (r), can be used to describe the genetic structure of a population. These parameters should be tested for their significance. The following hypotheses are formulated:
For each hypothesis, the likelihoods, and , are calculated, respectively, where the tilde corresponds to the MLEs for the null hypothesis and the hat corresponds to the MLEs for the alternative hypothesis. The log-likelihood ratio test statistic is then calculated by using
which is asymptotically χ2-distributed with one degree of freedom. The estimates of the parameters under each null hypothesis should be derived separately.
3. Computer simulation
(i) Design
Computer simulation was conducted to examine the statistical properties of the two-locus model for estimating the LD, outcrossing rate and recombination fraction between different molecular markers in a natural population. We consider a set of OP families randomly derived from a natural population, in which the genotype distribution of two given markers were simulated with their frequencies. The frequencies of two-locus genotypes (derived from diplotypes) are determined by gamete frequencies, specified by allele frequencies and LD in a population and the recombination fraction for a given maternal plant. We will consider the influences of different outcrossing rates (i.e. w=0·1 for low, 0·5 for medium and 0·9 for high), different recombination fractions (i.e. r=0·05 for a strong linkage and 0·25 for a weak linkage) and different linkage disequilibria (i.e. D=0·02 for strong independence and 0·10 for weak independence) on parameter estimation. We will consider all these possible combinations of parameter values.
To provide practical guidance on the use of this model, we simulate marker data with three different sampling strategies. A fixed number of samples (say 1000) can be allocated among and within OP families. We will use three sampling strategies: (1) small family number×large family size’ (10×100), (2) moderate family number×moderate family size (32×32), and large family number×small family size (100×10). Results under each of these strategies will be given.
(ii) Results
Tables 7–9 summarize the simulation results with different parameters and sampling strategies. In general, the model provides reasonable estimates of all parameters, although the accuracy and precision of parameter estimates depend on the values of these parameters, sampling strategies and interactions among all the factors. The estimation of the recombination fraction tends to prefer the ‘small family number×large family size’ sampling strategy (Table 7). It appears that the `large family number×small family size’ sampling strategy is favourable for the estimation of population genetic parameters including allele frequencies, LD and outcrossing rate (Table 9). The ‘moderate family number×moderate family size’ sampling strategy is somewhat in between (Table 8). In all the strategies, the estimation precision of allele frequencies and LD increases with increasing outcrossing rate. The recombination fraction can be estimated more precisely when the two markers are strongly linked. It is interesting to see that increasing LD leads to better estimation of the recombination fraction. As expected, increasing the outcrossing rate reduces the estimation precision of the recombination fraction. Especially, when outcrossing rate is very high (w=0·9), the recombination fraction will be poorly estimated for the two markers that are strongly independent.
There is reasonably good power for detecting a significant LD and linkage between two markers, although such a power varies with the values of the parameters (results not shown). It seems that the power detection is not sensitive to sampling strategies. There are marked interactions in the power sensitivity between parameter values. The power of linkage detection decreases with increasing outcrossing rate, whereas the power of LD detection increases with increasing outcrossing rate.
4. Discussion
The past two decades have witnessed a dramatic increase of interest in molecular marker technologies and their applications to study the genetic structure of a natural population and map QTLs (quantitative trait loci) responsible for a quantitative trait (Reich et al., Reference Reich, Cargill, Bolk, Ireland, Sabeti, Richter, Lavery, Kouyoumjian, Farhadian, Ward and Lander2001; Ardlie et al., Reference Ardlie, Kruglyak and Seielstad2002; Dawson et al., Reference Dawson, Abecasis, Bumpstead, Chen, Hunt, Beare, Pabial, Dibling, Tinsley, Kirby, Carter, Papaspyridonos, Livingstone, Ganske, Lohmmussaar, Zernant, Tonisson, Remm, Magi, Puurand, Vilo, Kurg, Rice, Deloukas, Mott, Metspalu, Bentley, Cardon and Dunham2002; Gabriel et al., Reference Gabriel, Schaffner, Nguyen, Moore, Roy, Blumenstiel, Higgins, DeFelice, Lochner, Faggart, Liu-Cordero, Rotimi, Adeyemo, Cooper, Ward, Lander, Daly and Altshuler2002; reviewed in Georges Reference Georges2007). In this paper, we have proposed an algorithmic approach for constructing the linkage–linkage disequilibrium map of a genome by genotyping a set of OP seeds sampled from a natural population. By estimating several key population genetic parameters, i.e. the relative occur-rence of selfing and outcrossing, LD, heterozygosity (estimated from allele frequencies) and recombination fraction, this approach will provide a tool for better understanding the pattern and organization of genetic variation in outcrossing populations. Furthermore, by elucidating the relationship between the linkage and LD in terms of the so-called LD map, the new algorithm can be used to infer the evolutionary history and process of natural populations and to identify genes for disease or yield traits (Remington et al., Reference Remington, Thornsberry, Matsuoka, Wilson, Whitt, Doebley, Kresovich, Goodman and Buckler2001; Ardlie et al., Reference Ardlie, Kruglyak and Seielstad2002; Farnir et al., Reference Farnir, Coppieters, Arranz, Berzi, Cambisano, Grisart, Karim, Marcq, Moreau, Mni, Nezer, Simon, Vanmanshoven, Wagenaar and Georges2000; McRae et al., Reference McRae, MceWan, Dodds, Wilson, Crawford and Slate2002; Rafalski & Morgante, Reference Rafalski and Morgante2004).
The new approach capitalizes on the outcrossing nature of plants, allowing a certain proportion of selfing. Outcrossing is a common characteristic of many plants, including economically and ecologically important species like poplar, eucalyptus, pine and spruce (Butcher & Southerton, Reference Butcher, Southerton, Guimaraes, Ruane, Scherf, Sonnino and Dargie2007; Miller & Schaal, Reference Miller and Schaal2006). This approach will find its immediate application in the genetic research of these important but understudied species. It has three significant advantages. First, it is simple and easily deployed in practice. By sampling and genotyping half-sib seeds from multiple maternal plants in a population, the approach provides the estimation of important population genetic parameters. Second, we derived a group of EM-based closed forms for parameter estimation for the OP-based sampling strategy, greatly facilitating the computing process of the parameters. The accuracy and precision of parameter estimation are affected by many factors including sample size and parameter range. A reasonable sampling strategy including the relative importance of family number and family size can be readily determined from simulation studies. Third, the approach allows the test of a number of meaningful hypotheses about the linkage, LD and outcrossing rate, providing a quantitative framework for understanding the genetic structure of a natural outcrossing population.
The approach can be extended to consider multiallelic markers and dominant markers. Unlike many annual crops, forest trees are still in wild or semi-wild conditions, in which there is a rich source of variation due to many alleles at a single gene. Multiallelic markers like microsatellites are a vital tool for the population genetic study of forest trees. On the other hand, for many underrepresented organisms, some economically cheap dominant markers are still useful although their informativeness is limited (Kuang et al., Reference Kuang, Richardson, Carson and Bongarten1998; Silbiger et al., Reference Silbiger, Christ, Leonard, Garg, Lattier, Dawes, Dimsoski, McCormick, Wessendarp, Gordon, Roth, Smith and Toth1998; Kremer et al., Reference Kremer, Caron, Cavers, Colpaert, Gheysen, Gribe, Lemes, Lowe, Margis, Navarro and Salgueiro2005). When the OP-based sampling strategy considers multiallelic or dominant markers, new, more complicated algorithms need to be derived. Li et al. (Reference Li, Li, Wu, Han, Wang, Hou, Zeng and Wu2007) derived a model for the LD between dominant markers in a diploid population. Their model can be integrated with our OP-based sampling strategy to provide a comprehensive estimation of population genetic parameters with dominant markers. The computer code of the proposed algorithm is available from the corresponding author upon request.
This work was partially supported by Joint NSF/NIH grant number DMS/NIGMS-0540745 and NNSFC grant number 30771752.