An algorithmic model for constructing a linkage and linkage disequilibrium map in outcrossing plant populations

JIAHAN LI; QIN LI; WEI HOU; KUN HAN; YAO LI; SONG WU; YANCHUN LI; RONGLING WU

doi:10.1017/S0016672308009932

An algorithmic model for constructing a linkage and linkage disequilibrium map in outcrossing plant populations

Published online by Cambridge University Press: 17 February 2009

JIAHAN LI ,

QIN LI ,

WEI HOU ,

KUN HAN ,

YAO LI ,

SONG WU ,

YANCHUN LI and

RONGLING WU

Show author details

JIAHAN LI*: Affiliation:
Department of Statistics, University of Florida, Gainesville, FL 32611, USA
QIN LI*: Affiliation:
Department of Statistics, University of Florida, Gainesville, FL 32611, USA
WEI HOU*: Affiliation:
Department of Epidemiology and Health Policy Research, University of Florida, Gainesville, FL 32611, USA
KUN HAN*: Affiliation:
School of Forestry and Biotechnology, Zhejiang Forestry University, Lin'an, Zhejiang 311300, People's Republic of China, and
YAO LI: Affiliation:
Department of Statistics, University of Florida, Gainesville, FL 32611, USA
SONG WU: Affiliation:
Department of Statistics, University of Florida, Gainesville, FL 32611, USA
YANCHUN LI: Affiliation:
School of Forestry and Biotechnology, Zhejiang Forestry University, Lin'an, Zhejiang 311300, People's Republic of China, and
RONGLING WU*: Affiliation:
Department of Statistics, University of Florida, Gainesville, FL 32611, USA School of Forestry and Biotechnology, Zhejiang Forestry University, Lin'an, Zhejiang 311300, People's Republic of China, and Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
*: *These authors contributed equally to this work.
*These authors contributed equally to this work.
*These authors contributed equally to this work.
*These authors contributed equally to this work.
*These authors contributed equally to this work.

Article contents

Summary
Introduction
Model
Computer simulation
Discussion
References

Rights & Permissions

Summary

A linkage–linkage disequilibrium map that describes the pattern and extent of linkage dis-equilibrium (LD) decay with genomic distance has now emerged as a viable tool to unravel the genetic structure of population differentiation and fine-map genes for complex traits. The prerequisite for constructing such a map is the simultaneous estimation of the linkage and LD between different loci. Here, we develop a computational algorithm for simultaneously estimating the recombination fraction and LD in a natural outcrossing population with multilocus marker data, which are often estimated separately in most molecular genetic studies. The algorithm is founded on a commonly used progeny test with open-pollinated offspring sampled from a natural population. The information about LD is reflected in the co-segregation of alleles at different loci among parents in the population. Open mating of parents will reveal the genetic linkage of alleles during meiosis. The algorithm was constructed within the polynomial-based mixture framework and implemented with the Expectation–Maximization (EM) algorithm. The by-product of the derivation of this algorithm is the estimation of outcrossing rate, a parameter useful to explore the genetic diversity of the population. We performed computer simulation to investigate the influences of different sampling strategies and different values of parameters on parameter estimation. By providing a number of testable hypotheses about population genetic parameters, this algorithmic model will open a broad gateway to understand the genetic structure and dynamics of an outcrossing population under natural selection.

Type: Paper
Information: Genetics Research , Volume 91 , Issue 1 , February 2009 , pp. 9 - 21

DOI: https://doi.org/10.1017/S0016672308009932 [Opens in a new window]
Copyright: Copyright © 2009 Cambridge University Press

1. Introduction

The pattern of LD decay with genetic distance (measured in terms of the recombination fraction) can be used to characterize the genetic structure and dynamics of populations. This needs the simultaneous measures of the LD and recombination fraction between the same pair of loci. However, the estimation of these two parameters is usually based on different genetic designs; i.e. the estimation of the recombination fraction relies on a segregating pedigree, whereas the estimation of LD needs a random sample drawn from a natural population. More recently, several designs have been proposed to jointly measure the linkage and linkage disequilibrium for natural populations (Wu & Zeng, Reference Wu and Zeng2001) and domestic animals (Georges, Reference Georges2007). Simultaneous estimation of LD and the recombination fraction can avoid false positive results (spurious LD) when LD is used to fine-map genes for complex traits given the frequent occurrence of LD between distantly spaced loci or unlinked loci.

Wu & Zeng (Reference Wu and Zeng2001) proposed an open-pollinated (OP) design for population genetic studies of forest trees with molecular markers. For most forest tree species, seeds from a single mother tree are derived from the open pollination of unknown fathers from the pollen pool. By collecting OP seeds from a sample of individual trees in a natural population, Wu & Zeng's design did not take into account the hermaphroditical nature of a tree species in which both sexes exist on the same individual, and thus its seeds may be derived from both selfing and outcrossing pollination. Self-fertilization is thought to affect diversity by reduced effective population size and reduced genome-wide effective recombination rates, both due to increased homozygosity, elevated isolation among individuals and subpopulations induced by inbreeding (Charlesworth, Reference Charlesworth2003). Consequently, a predominantly selfing mode of reproduction may be expected to lead to low polymorphism, extensive LD and high population subdivision (Nordborg, Reference Nordborg2000; Ingvarsson, Reference Ingvarsson2002, Reference Ingvarsson2005). These predictions can be tested by simultaneous estimation of the outcrossing rate, recombination fraction and LD. Also, the OP design proposed by Wu & Zeng (Reference Wu and Zeng2001) did not incorporate a procedure for estimating the diplotype of heterozygous trees from which seeds are sampled. In this paper, we extend Wu & Zeng's OP progeny design to better understand the genetic structure of a natural population by simultaneously estimating multiple population genetic parameters with molecular markers. Simulation studies were performed to examine the statistical behaviour of the model.

Model

(i) Sampling and genotyping strategy

Suppose there is a natural population at Hardy–Weinberg equilibrium (HWE) for a dioecious plant species. Each plant in the population is OP by its own pollen (selfing) and randomly by the pollen from other individuals (outcrossing). Thus, seeds produced by each plant include a mix of offspring due to selfing and outcrossing pollination. We will randomly sample a set of maternal plants and further randomly collect a sample of seeds from each sampled plant. Because the fathers of seeds from a sampled maternal plant are unknown, this sampling strategy will generate a set of half-sib families. The collected seeds (embryos) are germinated into seedlings. DNA samples are taken from maternal plants and their offspring derived from the seeds for marker analysis.

A panel of molecular markers is typed to examine population genetic properties by estimating the recombination fraction, LD and outcrossing rate. Consider two markers, each with two alleles, 1 and 0, which are generally denoted by i for the first marker (i=1, 0) and j for the second marker (j=1, 0). Different alleles at each marker unite to form four gametes, whose frequencies in the population are expressed as

(1)

$\matrix{ {p_{\setnum{11}} \equals pq \plus D} \hfill \tab {{\rm for\ gamete\ }11} \comma \hfill \cr {p_{\setnum{10}} \equals p\lpar 1 \minus q\rpar \minus D} \hfill \tab {{\rm for\ gamete\ }10} \comma \hfill \cr {p_{\setnum{01}} \equals \lpar 1 \minus p\rpar q \minus D} \hfill \tab {{\rm for\ gamete\ }01} \comma \hfill \cr {p_{\setnum{00}} \equals \lpar 1 \minus p\rpar \lpar 1 \minus q\rpar \plus D} \hfill \tab {{\rm for\ gamete\ }00} \comma \hfill \cr}$

which sum to one, where p and 1−p are the frequencies of two alleles, 1 and 0, for the first marker, q and 1−q are the frequencies of two alleles for the second marker, and D is the degree of gametic LD between the two markers. It is assumed that there is no sex-specific difference in gamete frequencies, allele frequencies and LD in the population.

Among the sampled maternal plants, there are nine genotypes for the two markers considered, generally expressed as ii′jj′ (i⩾i′=1, 0; j⩾j′=1, 0). Let N _ii′jj′ denote the number of maternal plants with genotype ii′jj′. Under the assumption of HWE, the frequency of a diplotype is the product of the frequencies of the gametes that form the diplotype. By collapsing those diplotypes that are observed as the same genotype, the frequencies of genotypes are generally expressed as

(2)

$P_{ii \prime jj \prime} \equals \left\{ {\matrix{ {p_{ij} ^{\setnum{2}} } \hfill \tab {{\rm for\ }i \equals i \prime{\rm \ and\ }j \equals j \prime\comma } \hfill \cr {p_{ij} p_{ij \prime} \plus p_{ij \prime} p_{ij} } \hfill \tab {{\rm for\ }i \equals i \prime {\rm \ and\ }j \ne j \prime\comma } \hfill \cr {p_{ij} p_{i \prime j} \plus p_{i \prime j} p_{ij} } \hfill \tab {{\rm for\ }i \ne i \prime {\rm \ and\ }j \equals j \prime\comma } \hfill \cr {{p_{ij} p_{i \prime j \prime} \plus p_{i \prime j \prime} p_{ij}}} \hfill \cr \!\!\quad {\plus p_{ij \prime} p_{i \prime j} \plus p_{i \prime j} p_{ij \prime} } \hfill \tab {{\rm for\ }i \ne i \prime {\rm \ and\ }j \ne j \prime.} \hfill \cr} } \right.$

Table 1 gives the genotype frequencies of the maternal plants for the two markers in terms of haplotype or diplotype frequencies calculated with equation (2).

Table 1. Diplotype and genotype frequencies of two markers, A and B, in the offspring population through outcrossing and selfing pollination

The seeds collected from each sampled maternal plant are typed for the two markers so that the genotype of each offspring can be known. The same offspring genotype from the same maternal genotype are mixed up. Let $N_{ii \prime jj \prime}^{ll \prime rr \prime}$ be the mixed number of offspring with genotype ll′rr′ (l⩾l′=1, 0; r⩾r′=1, 0) collected from N _ii′jj′ maternal plants with genotype ii′jj′. From observed offspring genotypes, we will estimate key population genetic parameters that define population structure and organization.

(ii) Offspring structure

Each sampled maternal plant undergoes meiosis to produce male and female gametes. For those double homozygotes at the two markers considered, only one gamete type is yielded. The plants which are heterozygous only for one marker produce two types of gametes with equal frequency. For the double heterozygote plants, there are four possible types of gametes: 11, 10, 01 and 00. This type of plant has two possible diplotypes 11|00 and 01|10, which will produce different gamete frequencies expressed as a function of the recombination fraction (r) (Table 2). Of all double heterozygotes, there is a relative proportion of

$\phi \equals {{p_{\setnum{11}} p_{\setnum{00}} } \over {p_{\setnum{11}} p_{\setnum{00}} \plus p_{\setnum{10}} p_{\setnum{01}} }}$

for diplotype 11|00, and

$\barphi \equals 1 \minus \phi \equals {{p_{\setnum{10}} p_{\setnum{01}} } \over {p_{\setnum{11}} p_{\setnum{00}} \plus p_{\setnum{10}} p_{\setnum{01}} }}$

for diplotype 10|01.

Table 2. Two possible diplotypes of a maternal plant of double heterozygote and the frequencies of its four gametes for two markers

Each female gamete produced by a maternal plant unites at random with its own male gamete to form a selfing offspring, or with a gamete from the pollen pool to form an outcrossing offspring. Table 1 lists the frequencies of two-marker diplotypes (and therefore genotypes) in the selfing and outcrossing offspring populations produced by possible maternal genotypes. The pollen pool that contributes to the outcrossing seeds contains four male gametes, 11, 10, 01 and 00, whose frequencies are defined by p ₁₁, p ₁₀, p ₀₁ and p ₀₀, respectively. Let w be the outcrossing rate of the plant measured by the proportion of its offspring that are generated through fertilization by pollens of other plants in the population. Thus, the selfing rate of the plant that receives its own pollen to pollinate is $\bar{w} \equals 1 \minus w$ .

Although marker genotypes of offspring sampled from a given maternal genotype can be observed, the mechanisms of genotype formation are unknown. The formation of progeny genotypes includes four mechanisms:

(1) Pollination behaviour: The same progeny genotype can be derived from the selfing or outcrossing of a maternal plant. Let $P_{ii \prime jj \prime}^{ll \prime rr \prime}$ denote the overall frequency of offspring genotype ll′rr′ derived from maternal genotype ii′jj′, which is generally expressed, by considering all possible mechanisms of genotype formation, as
(3)
$P_{ii \prime jj \prime}^{ll \prime rr \prime} \equals \bar{w} S_{ii \prime jj \prime}^{ll \prime rr \prime} \plus wO_{ii \prime jj \prime}^{ll \prime rr \prime} \comma$
where $S_{ii \prime jj \prime}^{ll \prime rr \prime}$ and $O_{ii \prime jj \prime}^{ll \prime rr \prime}$ are the frequencies of offspring genotype ll′rr′ derived from maternal genotype ii′jj′ due to the maternal plant's selfing and outcrossing pollination, respectively (Table 1).
(2) Sex origin of a gamete: The same progeny genotype may be due to reciprocal combinations between two gametes from male and female sides. For example, a maternal genotype 11/10 yields two female gametes, 11 and 10. When they unite with male gametes 10 and 11, respectively, the same progeny genotype results.
(3) The complementarity of gametes: If two female gametes of a maternal genotype are complementary to those of the pollen pool, such combinations will produce the same outcrossing progeny genotype. For example, two gametes of maternal genotype 11/10, 11 and 10 are respectively combined with complementary male gametes from the pollen pool, 00 and 01, to generate the same progeny genotype 10/10.
(4) Double heterozygote of a maternal plant: This type of plant contains two different diplotypes which produce the same arrays of gametes but with different relative proportions (see Table 2).

(iii) Likelihood and estimation

Based on the structure of offspring genotypes in Table 1, we construct a log likelihood for parameters Θ=(p ₁₁, p ₁₀, p ₀₁, p ₀₀, w, r) as

(4)

$\log{\rm \, }L\lpar \rmTheta \rpar \equals \mathop\sum\limits_{i\ges i \prime \equals \setnum{0}}^{\setnum{1}}\, {\mathop \sum\limits_{j\ges j \prime \equals \setnum{0}}^{\setnum{1}}\, {\mathop \sum\limits_{l\ges l \prime \equals \setnum{0}}^{\setnum{1}}\, {\mathop \sum\limits_{r\ges r \prime \equals \setnum{0}}^{\setnum{1}} {N_{ii \prime jj \prime}^{ll \prime rr \prime} \log\! \lpar P_{ii \prime jj \prime} P_{ii \prime jj \prime}^{ll \prime rr \prime} \rpar } } } } \comma$

Likelihood (4) is implemented with the Expectation–Maximization (EM) algorithm to obtain the maximum likelihood estimates (MLEs) of Θ.

(a) Estimation of gamete frequencies

In the E step, calculate the expected numbers of gametes 11, 10, 01 and 00 within each observed offspring genotype derived from a maternal genotype, expressed as

(5)

$_{k} \rmPsi _{ii \prime jj \prime}^ {ll \prime rr \prime}\quad \lpar k \equals 11\comma 10\comma 01\comma 00\rpar.$

Tables 3–6 provide the formulae for estimating the expected numbers of gametes 11, 10, 01 and 00, respectively. In the M step, estimate the gamete frequencies using

(6)

$p_{k} \equals {{\sum\nolimits_{i\ges i \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{j\ges j \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{l\ges l \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{r\ges r \prime \equals \setnum{0}}^{\setnum{1}} {_{k} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime} N_{ii \prime jj \prime}^{ll \prime rr \prime} } } } } } \over {2\sum\nolimits_{i\ges i \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{j\ges j \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{l\ges l \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{r\ges r \prime \equals \setnum{0}}^{\setnum{1}} {N_{ii \prime jj \prime}^{ll \prime rr \prime} } } } } }}.$

Table 3. The expected number ( $_{\setnum{11}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ ) of gamete 11, within an offspring genotype derived from a maternal genotype. Note that the double heterozygote is obtained by dividing the expression in the table by both the frequency of maternal genotype ( ${\rm P}_{\rm ii \prime jj \prime}$ ) and the overall frequency of the corresponding offspring genotype ( ${\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}$ ), whereas $_{\setnum{11}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Table 4. The expected number ( $_{\setnum{10}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ ) of gamete 10, within an offspring genotype derived from a maternal genotype. Note that $_{\setnum{10}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ for the double heterozygote is obtained by dividing the expression in the table by both the frequency of maternal genotype ( ${\rm P}_{{\rm ii \prime jj \prime}}$ ) and the overall frequency of the corresponding offspring genotype ( ${\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}$ ), whereas $_{\setnum{10}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Table 5. The expected number ( $_{\setnum{01}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ ) of gamete 01, within an offspring genotype derived from a maternal genotype. Note that $_{\setnum{01}} \rmPsi _{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}$ for the double heterozygote is obtained by dividing the expression in the table by both the frequency of the maternal genotype (P_ii′jj′) and the overall frequency of the corresponding offspring genotype ( ${\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}$ ), whereas $_{\setnum{01}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Table 6. The expected number ( $_{\setnum{00}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ ) of gamete 00, within an offspring genotype derived from a maternal genotype. Note that $_{\setnum{00}} \rmPsi _{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}$ for the double heterozygote is obtained by dividing the expression in the table by both the frequency of maternal genotype (P_ii′jj′) and the overall frequency of the corresponding offspring genotype ( ${\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}$ ), whereas $_{\setnum{00}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}$ for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Estimation of the recombination fraction

In the E step, calculate the expected number of r within an offspring genotype derived from a maternal plant of double heterozygote using

(7)

$\hskip -5pt\eqalign{\tab\display \matrix{ {R_{\setnum{10}\sol \setnum{10}}^{\setnum{11}\sol \setnum{11}} \equals \displaystyle{{\barphi r\lpar wp_{\setnum{11}} \plus \bar{w}r\rpar } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{11}\sol \setnum{11}} }}\comma } \tab {R_{\setnum{10}\sol \setnum{10}}^{\setnum{11}\sol \setnum{10}} \equals\displaystyle {{r\lsqb w\lpar \phi p_{\setnum{11}} \plus \barphi p_{\setnum{10}} \rpar \plus \bar{w}\bar{r}\rsqb } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{11}\sol \setnum{11}} }}\comma } \cr {R_{\setnum{10}\sol \setnum{10}}^{\setnum{11}\sol \setnum{00}} \equals \displaystyle{{\phi r\lpar wp_{\setnum{10}} \plus \bar{w}r\rpar } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{11}\sol \setnum{00}} }}\comma } \tab {R_{\setnum{10}\sol \setnum{10}}^{\setnum{10}\sol \setnum{11}} \equals\displaystyle {{r\lsqb w\lpar \phi p_{\setnum{11}} \plus \barphi p_{\setnum{01}} \rpar \plus \bar{w}\bar{r}\rsqb } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{10}\sol \setnum{11}} }}\comma } \cr} \cr \tab R_{\setnum{10}\sol \setnum{10}}^{\setnum{10}\sol \setnum{10}} \equals\displaystyle {{r\lsqb \phi w\lpar p_{\setnum{10}} \plus p_{\setnum{01}} \rpar \plus \barphi w\lpar p_{\setnum{11}} \plus p_{\setnum{00}} \rpar \plus 2\bar{w}{r}\rsqb } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{10}\sol \setnum{10}} }}\comma \cr \tab \matrix{ {R_{\setnum{10}\sol \setnum{10}}^{\setnum{10}\sol \setnum{00}} \equals\displaystyle {{r\lsqb w\lpar \phi p_{\setnum{00}} \plus \barphi p_{\setnum{10}} \rpar \plus \bar{w}\bar{r}\rsqb } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{10}\sol \setnum{00}} }}\comma } \tab {R_{\setnum{10}\sol \setnum{10}}^{\setnum{00}\sol \setnum{11}} \equals\displaystyle {{\phi r\lpar wp_{\setnum{01}} \plus \bar{w}r\rpar } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{00}\sol \setnum{11}} }}\comma } \cr {R_{\setnum{10}\sol \setnum{10}}^{\setnum{00}\sol \setnum{10}} \equals\displaystyle {{r\lsqb w\lpar \phi p_{\setnum{00}} \plus \barphi p_{\setnum{01}} \rpar \plus \bar{w}\bar{r}\rsqb } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{00}\sol \setnum{10}} }}\comma } \tab {R_{\setnum{10}\sol \setnum{10}}^{\setnum{00}\sol \setnum{00}} \equals\displaystyle {{ \barphi r\lpar wp_{\setnum{00}} \plus \bar{w}r\rpar } \over {2P_{\setnum{10}\sol \setnum{10}}^{\setnum{00}\sol \setnum{00}} }}.} \cr} \cr}$

In the M step, estimate the recombination fraction using

(8)

$r \equals {{\sum\nolimits_{l\ges l \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{r\ges r \prime \equals \setnum{0}}^{\setnum{1}} {R_{\setnum{10}\sol \setnum{10}}^{ll \prime rr \prime} N_{\setnum{10}\sol \setnum{10}}^{ll \prime rr \prime} } } } \over {\sum\nolimits_{l\ges l \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{r\ges r \prime \equals \setnum{0}}^{\setnum{1}} {N_{\setnum{10}\sol \setnum{10}}^{ll \prime rr \prime} } } }}.$

Estimation of outcrossing rate

In the E step, calculate the expected number of w within each possible offspring genotype derived from a maternal genotype using

(9)

$W_{ii \prime jj \prime}^{ll \prime rr \prime} \equals {{wO_{ii \prime jj \prime}^{ll \prime rr \prime} } \over {\bar{w}S_{ii \prime jj \prime}^{ll \prime rr \prime} \plus wO_{ii \prime jj \prime}^{ll \prime rr \prime} }}.$

In the M step, calculate the outcrossing rate using

(10)

$w \equals {{\sum\nolimits_{i\ges i \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{j\ges j \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{l\ges l \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{r\ges r \prime \equals \setnum{0}}^{\setnum{1}} {W_{ii \prime jj \prime}^{ll \prime rr \prime} N_{ii \prime jj \prime}^{ll \prime rr \prime} } } } } } \over {\sum\nolimits_{i\ges i \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{j\ges j \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{l\ges l \prime \equals \setnum{0}}^{\setnum{1}} {\sum\nolimits_{r\ges r \prime \equals \setnum{0}}^{\setnum{1}} {N_{ii \prime jj \prime}^{ll \prime rr \prime} } } } } }}.$

Note that not all offspring genotypes contain w if they are not derived from a double heterozygote maternal plant (see Table 1), although the summation over all possible genotypes is generally given.

An iterative loop of E and M steps between equations (5), (7) and (9) and equations (6), (8) and (10) is constructed to estimate the parameters. After haplotype frequencies are estimated, allele frequencies at two markers and their LD (D) are estimated as

$\eqalign{\tab \hat{p} \equals \hat{p}_{\setnum{11}} \plus \hat{p}_{\setnum{10}} \comma \cr \tab \hat{q} \equals \hat{p}_{\setnum{11}} \plus \hat{p}_{\setnum{01}} \comma \cr \tab \hats D \equals \hat{p}_{\setnum{11}} \hat{p}_{\setnum{00}} \minus \hat{p}_{\setnum{10}} \hat{p}_{\setnum{01}} . \cr}$

(iv) Hypothesis testing

The genetic parameters estimated, i.e. the LD (D), outcrossing rate (w) and recombination fraction (r), can be used to describe the genetic structure of a population. These parameters should be tested for their significance. The following hypotheses are formulated:

$\matrix{ {H_{\setnum{0}} \colon \ D \equals 0} \hfill \tab {vs.} \hfill \tab {H_{\setnum{1}} \colon \ D \ne 0\comma } \hfill \cr {H_{\setnum{0}} \colon \ w \equals 0} \hfill \tab {vs.} \hfill \tab {H_{\setnum{1}} \colon \ w \ne 0\comma } \hfill \cr {H_{\setnum{0}} \colon \ w \equals 1} \hfill \tab {vs.} \hfill \tab {H_{\setnum{1}} \colon \ w \ne 1\comma } \hfill \cr {H_{\setnum{0}} \colon \ r \equals 0 {\cdot} 5} \hfill \tab {vs.} \hfill \tab {H_{\setnum{1}} \colon \ r \ne 0 {\cdot} 5.} \hfill \cr}$

For each hypothesis, the likelihoods, $L_{\setnum{0}} \lpar \tilde{\rmTheta }\rpar$ and $L_{\setnum{1}} \lpar \tilde{\rmTheta }\rpar$ , are calculated, respectively, where the tilde corresponds to the MLEs for the null hypothesis and the hat corresponds to the MLEs for the alternative hypothesis. The log-likelihood ratio test statistic is then calculated by using

(11)

${\rm LR} \equals \minus 2\lsqb \ln {\rm \, }L_{\setnum{0}} \lpar \tilde{\rmTheta }\rpar \minus \ln {\rm \, }L_{\setnum{1}} \lpar {\hatTheta} \rpar \rsqb \comma$

which is asymptotically χ²-distributed with one degree of freedom. The estimates of the parameters under each null hypothesis should be derived separately.

3. Computer simulation

(i) Design

Computer simulation was conducted to examine the statistical properties of the two-locus model for estimating the LD, outcrossing rate and recombination fraction between different molecular markers in a natural population. We consider a set of OP families randomly derived from a natural population, in which the genotype distribution of two given markers were simulated with their frequencies. The frequencies of two-locus genotypes (derived from diplotypes) are determined by gamete frequencies, specified by allele frequencies and LD in a population and the recombination fraction for a given maternal plant. We will consider the influences of different outcrossing rates (i.e. w=0·1 for low, 0·5 for medium and 0·9 for high), different recombination fractions (i.e. r=0·05 for a strong linkage and 0·25 for a weak linkage) and different linkage disequilibria (i.e. D=0·02 for strong independence and 0·10 for weak independence) on parameter estimation. We will consider all these possible combinations of parameter values.

To provide practical guidance on the use of this model, we simulate marker data with three different sampling strategies. A fixed number of samples (say 1000) can be allocated among and within OP families. We will use three sampling strategies: (1) small family number×large family size’ (10×100), (2) moderate family number×moderate family size (32×32), and large family number×small family size (100×10). Results under each of these strategies will be given.

(ii) Results

Tables 7–9 summarize the simulation results with different parameters and sampling strategies. In general, the model provides reasonable estimates of all parameters, although the accuracy and precision of parameter estimates depend on the values of these parameters, sampling strategies and interactions among all the factors. The estimation of the recombination fraction tends to prefer the ‘small family number×large family size’ sampling strategy (Table 7). It appears that the `large family number×small family size’ sampling strategy is favourable for the estimation of population genetic parameters including allele frequencies, LD and outcrossing rate (Table 9). The ‘moderate family number×moderate family size’ sampling strategy is somewhat in between (Table 8). In all the strategies, the estimation precision of allele frequencies and LD increases with increasing outcrossing rate. The recombination fraction can be estimated more precisely when the two markers are strongly linked. It is interesting to see that increasing LD leads to better estimation of the recombination fraction. As expected, increasing the outcrossing rate reduces the estimation precision of the recombination fraction. Especially, when outcrossing rate is very high (w=0·9), the recombination fraction will be poorly estimated for the two markers that are strongly independent.

Table 7. MLEs of parameters and their standard errors (in parentheses) obtained from 100 simulation replicates with the (small family number×large family size) sampling strategy

Table 8. MLEs of parameters and their standard errors (in parentheses) obtained from 100 simulation replicates with the (moderate family number×moderate family size) sampling strategy

Table 9. MLEs of parameters and their standard errors (in parentheses) obtained from 100 simulation replicates with the (large family number×small family size) sampling strategy

There is reasonably good power for detecting a significant LD and linkage between two markers, although such a power varies with the values of the parameters (results not shown). It seems that the power detection is not sensitive to sampling strategies. There are marked interactions in the power sensitivity between parameter values. The power of linkage detection decreases with increasing outcrossing rate, whereas the power of LD detection increases with increasing outcrossing rate.

4. Discussion

The past two decades have witnessed a dramatic increase of interest in molecular marker technologies and their applications to study the genetic structure of a natural population and map QTLs (quantitative trait loci) responsible for a quantitative trait (Reich et al., Reference Reich, Cargill, Bolk, Ireland, Sabeti, Richter, Lavery, Kouyoumjian, Farhadian, Ward and Lander2001; Ardlie et al., Reference Ardlie, Kruglyak and Seielstad2002; Dawson et al., Reference Dawson, Abecasis, Bumpstead, Chen, Hunt, Beare, Pabial, Dibling, Tinsley, Kirby, Carter, Papaspyridonos, Livingstone, Ganske, Lohmmussaar, Zernant, Tonisson, Remm, Magi, Puurand, Vilo, Kurg, Rice, Deloukas, Mott, Metspalu, Bentley, Cardon and Dunham2002; Gabriel et al., Reference Gabriel, Schaffner, Nguyen, Moore, Roy, Blumenstiel, Higgins, DeFelice, Lochner, Faggart, Liu-Cordero, Rotimi, Adeyemo, Cooper, Ward, Lander, Daly and Altshuler2002; reviewed in Georges Reference Georges2007). In this paper, we have proposed an algorithmic approach for constructing the linkage–linkage disequilibrium map of a genome by genotyping a set of OP seeds sampled from a natural population. By estimating several key population genetic parameters, i.e. the relative occur-rence of selfing and outcrossing, LD, heterozygosity (estimated from allele frequencies) and recombination fraction, this approach will provide a tool for better understanding the pattern and organization of genetic variation in outcrossing populations. Furthermore, by elucidating the relationship between the linkage and LD in terms of the so-called LD map, the new algorithm can be used to infer the evolutionary history and process of natural populations and to identify genes for disease or yield traits (Remington et al., Reference Remington, Thornsberry, Matsuoka, Wilson, Whitt, Doebley, Kresovich, Goodman and Buckler2001; Ardlie et al., Reference Ardlie, Kruglyak and Seielstad2002; Farnir et al., Reference Farnir, Coppieters, Arranz, Berzi, Cambisano, Grisart, Karim, Marcq, Moreau, Mni, Nezer, Simon, Vanmanshoven, Wagenaar and Georges2000; McRae et al., Reference McRae, MceWan, Dodds, Wilson, Crawford and Slate2002; Rafalski & Morgante, Reference Rafalski and Morgante2004).

The new approach capitalizes on the outcrossing nature of plants, allowing a certain proportion of selfing. Outcrossing is a common characteristic of many plants, including economically and ecologically important species like poplar, eucalyptus, pine and spruce (Butcher & Southerton, Reference Butcher, Southerton, Guimaraes, Ruane, Scherf, Sonnino and Dargie2007; Miller & Schaal, Reference Miller and Schaal2006). This approach will find its immediate application in the genetic research of these important but understudied species. It has three significant advantages. First, it is simple and easily deployed in practice. By sampling and genotyping half-sib seeds from multiple maternal plants in a population, the approach provides the estimation of important population genetic parameters. Second, we derived a group of EM-based closed forms for parameter estimation for the OP-based sampling strategy, greatly facilitating the computing process of the parameters. The accuracy and precision of parameter estimation are affected by many factors including sample size and parameter range. A reasonable sampling strategy including the relative importance of family number and family size can be readily determined from simulation studies. Third, the approach allows the test of a number of meaningful hypotheses about the linkage, LD and outcrossing rate, providing a quantitative framework for understanding the genetic structure of a natural outcrossing population.

The approach can be extended to consider multiallelic markers and dominant markers. Unlike many annual crops, forest trees are still in wild or semi-wild conditions, in which there is a rich source of variation due to many alleles at a single gene. Multiallelic markers like microsatellites are a vital tool for the population genetic study of forest trees. On the other hand, for many underrepresented organisms, some economically cheap dominant markers are still useful although their informativeness is limited (Kuang et al., Reference Kuang, Richardson, Carson and Bongarten1998; Silbiger et al., Reference Silbiger, Christ, Leonard, Garg, Lattier, Dawes, Dimsoski, McCormick, Wessendarp, Gordon, Roth, Smith and Toth1998; Kremer et al., Reference Kremer, Caron, Cavers, Colpaert, Gheysen, Gribe, Lemes, Lowe, Margis, Navarro and Salgueiro2005). When the OP-based sampling strategy considers multiallelic or dominant markers, new, more complicated algorithms need to be derived. Li et al. (Reference Li, Li, Wu, Han, Wang, Hou, Zeng and Wu2007) derived a model for the LD between dominant markers in a diploid population. Their model can be integrated with our OP-based sampling strategy to provide a comprehensive estimation of population genetic parameters with dominant markers. The computer code of the proposed algorithm is available from the corresponding author upon request.

This work was partially supported by Joint NSF/NIH grant number DMS/NIGMS-0540745 and NNSFC grant number 30771752.

References

Ardlie, K. G., Kruglyak, L. & Seielstad, M. (2002). Patterns of linkage disequilibrium in the human genome. Nature Reviews Genetics 3, 299–309.CrossRef Google Scholar PubMed

Butcher, P. A. & Southerton, S. (2007). Marker-assisted selection in forestry species. In Marker-assisted Selection: Current Status and Future Perspectives in Crops, Livestock, Forestry and Fish (ed. Guimaraes, E., Ruane, J., Scherf, B., Sonnino, A. and Dargie, J.), pp. 283–305, chapter 15. Rome: FAO.Google Scholar

Charlesworth, D. (2003). The effects of inbreeding on the genetic diversity of populations. Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences 358, 1051–1070.CrossRef Google Scholar PubMed

Dawson, E., Abecasis, G. R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D. M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S., Carter, D., Papaspyridonos, M., Livingstone, S., Ganske, R., Lohmmussaar, E., Zernant, J., Tonisson, N., Remm, M., Magi, R., Puurand, T., Vilo, J., Kurg, A., Rice, K., Deloukas, P., Mott, R., Metspalu, A., Bentley, D. R., Cardon, L. R. & Dunham, I. (2002). A first-generation linkage disequilibrium map of human chromosome 22. Nature 418, 544–548.CrossRef Google Scholar PubMed

Farnir, F., Coppieters, W., Arranz, J.-J., Berzi, P., Cambisano, N., Grisart, B., Karim, L., Marcq, F., Moreau, L., Mni, M., Nezer, C., Simon, P., Vanmanshoven, P., Wagenaar, D. & Georges, M. (2000). Extensive genome-wide linkage disequilibrium in cattle. Genome Research 10, 220–227.CrossRef Google Scholar PubMed

Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S. N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E. S., Daly, M. J. & Altshuler, D. (2002). The structure of haplotype blocks in the human genome. Science 296, 2225–2229.CrossRef Google Scholar PubMed

Georges, M. (2007). Mapping, fine mapping, and molecular dissection of quantitative trait loci in domestic animals. Annual Review of Genomics and Human Genetics 8, 131–162.CrossRef Google Scholar PubMed

Hedrick, P. W. (1987). Gametic disequilibrium measures: proceed with caution. Genetics 117, 331–341.CrossRef Google Scholar PubMed

Hill, W. G. (1974). Estimation of linkage disequilibrium in randomly mating populations. Heredity 33, 229–239.CrossRef Google Scholar PubMed

Ingvarsson, P. K. (2002). A metapopulation perspective of genetic diversity and differentiation in partially self-fertilizing plants. Evolution 56, 2368–2373.Google Scholar PubMed

Ingvarsson, P. K. (2005). Nucleotide polymorphism and linkage disequilibrium within and among natural populations of European aspen (Populus tremula L., Salicaceae). Genetics 169, 945–953.CrossRef Google Scholar PubMed

Kremer, A., Caron, H., Cavers, S., Colpaert, N., Gheysen, G., Gribe, R., Lemes, M., Lowe, A. J., Margis, R., Navarro, C. & Salgueiro, F. (2005). Monitoring genetic diversity in tropical trees with multilocus dominant markers. Heredity 95, 274–280.CrossRef Google Scholar PubMed

Kuang, H., Richardson, T. E., Carson, S. D. & Bongarten, B. C. (1998). An allele responsible for seedling death in Pinus radiata D. Don. Theoretical and Applied Genetics 96, 640–644.CrossRef Google Scholar

Lewontin, R. C. (1964). The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49, 49–67.CrossRef Google Scholar PubMed

Li, Y. C., Li, Y., Wu, S., Han, K., Wang, Z. J., Hou, W., Zeng, Y. R. & Wu, R. L. (2007). Estimation of linkage disequilibria in diploid populations with multilocus dominant markers. Genetics 176, 1879–1892.CrossRef Google Scholar PubMed

Liu, T., Todhunter, R. J., Lu, Q., Schoettinger, L., Li, H. Y., Littell, R. C., Bliss, S., Acland, G., Lust, G. & Wu, R. L. (2006). Extent and distribution of zygotic linkage disequilibrium in canine. Genetics 174, 439–453.CrossRef Google Scholar PubMed

McRae, A. F., MceWan, J. C., Dodds, K. G., Wilson, T., Crawford, A. M. & Slate, J. (2002). Linkage disequilibrium in domestic sheep. Genetics 160, 1113–1122.CrossRef Google Scholar PubMed

Miller, A. J. & Schaal, B. A. (2006). Domestication and the distribution of genetic variation in wild and cultivated populations of the Mesoamerican fruit tree Spondias purpurea L. (Anacardiaceae). Molecular Ecology 15, 1467–1480.CrossRef Google Scholar PubMed

Nordborg, M. (2000). Linkage disequilibrium, gene trees and selfing: ancestral recombination graph with partial self-fertilization. Genetics 154, 923–929.CrossRef Google Scholar PubMed

Rafalski, A. & Morgante, M. (2004). Corn and humans: recombination and linkage disequilibrium in two genomes of similar size. Trends in Genetics 20, 103–111.CrossRef Google Scholar PubMed

Reich, D. E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P. C., Richter, D. J., Lavery, T., Kouyoumjian, R., Farhadian, S. F., Ward, R. & Lander, E. S. (2001). Linkage disequilibrium in the human genome. Nature 411, 199–204.CrossRef Google Scholar PubMed

Remington, D. L., Thornsberry, J. M., Matsuoka, Y., Wilson, L. M., Whitt, S. R., Doebley, J., Kresovich, S., Goodman, M. M. & Buckler, E. S. IV (2001). Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proceedings of the National Academy of Sciences of the USA 98, 11479–11484.CrossRef Google Scholar PubMed

Silbiger, R. N., Christ, S. A., Leonard, A. C., Garg, M., Lattier, D. L., Dawes, S., Dimsoski, P., McCormick, F., Wessendarp, T., Gordon, D. A., Roth, A. C., Smith, M. K. & Toth, G. P. (1998). Preliminary studies on the population genetics of the central stoneroller (Campostoma anomalum) from the Great Miami River Basin, Ohio. Environmental Monitoring and Assessment 51, 481–495.CrossRef Google Scholar

Tishkoff, S. A. & Williams, S. M. (2002). Genetic analysis of African populations: human evolution and complex disease. Nature Reviews Genetics 3, 611–621.CrossRef Google Scholar PubMed

Tishkoff, S. A., Dietzsch, E., Speed, W., Pakstis, A. J., Kidd, J. R., Cheung, K., Bonne-Tamir, B., Santachiara-Benerecetti, A. S., Moral, P., Krings, M., Pääbo, S., Watson, E., Risch, N., Jenkins, T. & Kidd, K. K. (1996). Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 271, 1380–1387.CrossRef Google Scholar PubMed

Tishkoff, S. A., Varkonyi, R., Cahinhinan, N., Abbes, S., Argyropoulos, G., Destro-Bisol, G., Drousiotou, A., Dangerfield, B., Lefranc, G., Loiselet, J., Piro, A., Stoneking, M., Tagarelli, A., Tagarelli, G., Touma, E. H., Williams, S. M. & Clark, A. G. (2001). Haplotype diversity and linkage disequilibrium at human G6PD: recent origin of alleles that confer malarial resistance. Science 293, 455–462.CrossRef Google Scholar PubMed

Weir, B. S. (1996). Genetic Data Analysis. Sunderland, MA: Sinauer.Google Scholar

Wu, R. L. & Zeng, Z.-B. (2001). Joint linkage and linkage disequilibrium mapping in natural populations. Genetics 157, 899–909.CrossRef Google Scholar PubMed

Table 1. Diplotype and genotype frequencies of two markers, A and B, in the offspring population through outcrossing and selfing pollination

Table 2. Two possible diplotypes of a maternal plant of double heterozygote and the frequencies of its four gametes for two markers

Table 3. The expected number (_{\setnum{11}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}) of gamete 11, within an offspring genotype derived from a maternal genotype. Note that the double heterozygote is obtained by dividing the expression in the table by both the frequency of maternal genotype ({\rm P}_{\rm ii \prime jj \prime}) and the overall frequency of the corresponding offspring genotype ({\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}), whereas _{\setnum{11}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime} for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Table 4. The expected number (_{\setnum{10}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}) of gamete 10, within an offspring genotype derived from a maternal genotype. Note that _{\setnum{10}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime} for the double heterozygote is obtained by dividing the expression in the table by both the frequency of maternal genotype ({\rm P}_{{\rm ii \prime jj \prime}}) and the overall frequency of the corresponding offspring genotype ({\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}), whereas _{\setnum{10}}\rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime} for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Table 5. The expected number (_{\setnum{01}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}) of gamete 01, within an offspring genotype derived from a maternal genotype. Note that _{\setnum{01}} \rmPsi _{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime} for the double heterozygote is obtained by dividing the expression in the table by both the frequency of the maternal genotype (Pii′jj′) and the overall frequency of the corresponding offspring genotype ({\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}), whereas _{\setnum{01}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime} for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Table 6. The expected number (_{\setnum{00}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime}) of gamete 00, within an offspring genotype derived from a maternal genotype. Note that _{\setnum{00}} \rmPsi _{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime} for the double heterozygote is obtained by dividing the expression in the table by both the frequency of maternal genotype (Pii′jj′) and the overall frequency of the corresponding offspring genotype ({\rm P}_{\rm ii \prime jj \prime}^{\rm ll \prime rr \prime}), whereas _{\setnum{00}} \rmPsi _{ii \prime jj \prime}^{ll \prime rr \prime} for all the genotypes is calculated by dividing the expression only by the overall frequency of the corresponding offspring genotypes

Table 7. MLEs of parameters and their standard errors (in parentheses) obtained from 100 simulation replicates with the (small family number×large family size) sampling strategy

Table 8. MLEs of parameters and their standard errors (in parentheses) obtained from 100 simulation replicates with the (moderate family number×moderate family size) sampling strategy

Table 9. MLEs of parameters and their standard errors (in parentheses) obtained from 100 simulation replicates with the (large family number×small family size) sampling strategy

Article contents

An algorithmic model for constructing a linkage and linkage disequilibrium map in outcrossing plant populations

Summary

1. Introduction

Model

(i) Sampling and genotyping strategy

(ii) Offspring structure

(iii) Likelihood and estimation

(a) Estimation of gamete frequencies

Estimation of the recombination fraction

Estimation of outcrossing rate

(iv) Hypothesis testing

3. Computer simulation

(i) Design

(ii) Results

4. Discussion

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests