1. Introduction
The objectives of genetic improvement of livestock are usually quantitative or complex traits such as milk yield or meat quality. Traditional genetic improvement has relied on using the recorded phenotype of each animal together with the knowledge of its pedigree to estimate its breeding value (BV), most often using the statistical method, known as best linear unbiased selection (BLUP) (Henderson, Reference Henderson1984). This technology has been very successful, leading to genetic gains in most farmed species (e.g. see Van Vleck et al., Reference Van Vleck, Westall and Scneider1986; Havenstein et al., Reference Havenstein, Ferket, Scheideler and Larson1994). Despite this success, there has long been an interest in using simply inherited genetic markers to increase the rate of genetic gain and to identify the genes and polymorphisms controlling traits in the breeding objectives (as summarized in Dekkers & Hospital, Reference Dekkers and Hospital2002).
Ideally one would identify causal polymorphisms affecting an objective trait and incorporate these in the selection criterion (Dekkers, Reference Dekkers2004). This has occurred for some mutations that cause genetic abnormalities and a small number of polymorphisms with large effects on quantitative traits (Dekkers, Reference Dekkers2004). However, these known causal polymorphisms explain only a small proportion of genetic variance in the breeding objective and have contributed only a small amount to the genetic gain achieved. This approach has been limited by our inability to identify most of the causal polymorphisms affecting our objective traits.
As new categories of genetic markers were discovered they have been tested for an association with quantitative traits, even though there was no a priori reason to expect an association. For instance, bovine blood groups were sometimes found to be associated with milk production traits (Neimann-Sorensen & Robertson, Reference Neimann-Sorensen and Robertson1961; Rendel, Reference Rendel1961). It is possible that this association was causal but more likely that it was due to linkage between the blood group loci and quantitative trait loci (QTL) that cause variation in milk production. These associations proved too weak and too unreliable to be useful in the selection of livestock.
Microsatellites were the first class of genetic markers that covered the genome and therefore, had the possibility to detect QTL no matter where they were located. Typically 100–200 microsatellites were used to cover the genome and they detected QTL by linkage within full-sib or half-sib families (Georges et al., Reference Georges, Nielsen, Mackinnon, Mishra, Okimoto, Pasquino, Sargeant, Sorensen, Steele, Zhao, Womack and Hoeschele1995). The limitations of these studies were that they mapped the QTL very imprecisely (often to confidence intervals of 50 cM) and the marker and QTL were in linkage equilibrium so that the linkage phase varied between families. Consequently the linkage phase had to be determined within each family before the marker could be used for selection. Fernando & Grossman (Reference Fernando and Grossman1989) presented a general method for estimating BVs using markers in linkage equilibrium with QTL but, in practice, the gains were small and this method of marker assisted selection has only been used rarely (but for an exception see Boichard et al., Reference Boichard, Fritz, Rossignol, Guillaume, Colleau and Druet2006). By saturating a QTL region with additional markers, the causal mutation has occasionally been discovered (Grisart et al., Reference Grisart, Coppitiers, Farnir, Karim, Ford, Berzi, Cambisano, Mni, Reid, Simon, Spelman, Georges and Snell2003), but only when it explained an unusually large proportion of genetic variance.
The QTL mapping studies showed that many QTL affect a typical quantitative trait (Hayes & Goddard, Reference Hayes and Goddard2001; Chamberlain et al., Reference Chamberlain, McPartlan and Goddard2007). Meuwissen & Goddard (Reference Meuwissen and Goddard1996) showed that the gain in selection response from marker assisted selection was nearly proportional to the proportion of genetic variance explained by the markers. Therefore, a new type of marker assisted selection was needed that utilized all QTL and that did not require linkage phase to be determined for each family. Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) showed with simulation that using a dense panel of markers covering the whole genome and in linkage disequilibrium (LD) with the QTL could lead to large increases in response to selection. This type of marker assisted selection has become known as genomic selection. It became feasible with the availability of panels of thousands of single nucleotide polymorphisms (SNPs) that could be genotyped at reasonable cost. It is already widely used in dairy cattle breeding (Dalton, Reference Dalton2009) and is expected to revolutionize all livestock genetic improvement programmes and can be extended to plants (Bernardo & Yu, Reference Bernardo and Yu2007; Heffner et al., Reference Heffner, Sorrels and Jannink2009; Zhong et al., Reference Zhong, Dekker, Fernando and Jannink2009), aquaculture (Sonesson & Meuwissen, Reference Sonesson and Meuwissen2009) and prediction of genetic risk in humans (Wray et al., Reference Wray, Goddard and Visscher2007). In this review, we will describe the methodology used, the factors determining the accuracy of selection, the implementation in breeding programmes, the effect on long-term genetic gain and the use of genomic selection for QTL mapping.
2. Methodology
The BV (bv) or additive genetic value of an individual j can be written as where ai is the additive effect of the ith QTL and xij is the genotype of the individual at the ith QTL coded as 0, 1 or 2 for homozygote, heterozygote and other homozygote respectively, and Nq is the number of QTL. In practice the QTL position and effects are not known. Instead we detect the QTL by their LD with markers such as SNPs. If there is sufficient LD, the genotype at a QTL, xi, can be predicted from a linear combination of marker genotypes and so BVs can be estimated by a linear combination of markers , where bi is the apparent effect of the ith marker due to its LD with one or more QTL, mij is the genotype of the jth individual at the ith marker and Nm is the number of markers. However, bi has to be estimated from data and so the estimated breeding value (EBV) for individual j becomes .
Selection theory shows that an EBV is most accurate if where data includes whatever information is available from which to estimate the BV. Here this means that the vector of marker effects b should be estimated as . The data (y) usually consists of a reference sample of the population that has been measured for the trait and genotyped for the markers. Assuming the data (y) have been corrected for all other effects, then, as presented in Goddard (Reference Goddard2009),
where p(b) is the prior distribution of b, and P(y|b) is the likelihood of the data given b. This shows that the best estimate of b depends on the distribution of b. If b follows a normal distribution with the same variance for all markers b~N(0, Iσb2) then (1) reduces to a BLUP estimate of b. Since 10 000–1 000 000 SNPs may be used, this assumption implies that all SNPs have very small effects and this is akin to the traditional infinitesimal model for quantitative traits.
Other assumptions for the distribution of b do not lead to closed form solutions for but can be calculated by Markov Chain Monte Carlo (MCMC) methods. For instance, Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) considered the case where the marker effects are assumed to follow a scaled t distribution. Marker effects of large size are more probable under a t distribution with a small number of degrees of freedom than under a normal distribution (i.e. the t distribution has ‘thicker tails’ or greater kurtosis than a normal distribution). This assumption might more correctly reflect the true situation than assuming that marker effects follow a normal distribution since some polymorphisms with large effects on quantitative traits are known. Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) called this model of marker effects ‘Bayes A’ and showed how Gibbs sampling could be used to estimate the marker effects and hence the BV of individuals. Although it allows for some markers with large effects, the Bayes A model still assumes that all markers have a non-zero effect. If the number of QTL is much smaller than the number of markers, one might expect that many markers have no effect after those in higher LD with the QTL have already been included in the model. Therefore, Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) introduced a model (that they called ‘Bayes B’) in which a proportion of the marker effects follow a scaled t distribution but the remainder of markers have no effect. Bayes B was implemented using a combination of Gibbs sampling and Metropolis–Hasting steps to estimate marker effects and hence individual's BVs.
As usual, the BLUP estimate of a marker effect can be interpreted as a least squares estimate that has been shrunk or regressed towards zero. This is also the case for estimates of marker effects under Bayes A and B models, but they shrink the estimates in a non-linear manner so that a least squares estimate that is small relative to its standard error is shrunk almost to zero, while estimates that are large are shrunk less severely. Other prior distributions of b including the double exponential also were considered (Yi & Xu, Reference Yi and Xu2008) and Meuwissen et al. (Reference Meuwissen, Solberg, Shepherd and Woolliams2009) gives a closed form solution for this.
For the case where b is normally distributed there is an equivalent model that is informative (Habier et al., Reference Habier, Fernando and Dekkers2007; VanRaden Reference VanRaden2008; Hayes & Goddard, Reference Hayes and Goddard2008). Using the matrix notation, if y=Mb+e and then with , where M is the matrix of marker genotypes with elements mij defined above. This is a conventional animal model where y is the sum of a bv and environmental error (e) but where the relationships among the individuals are estimated as MM′. Thus estimating the BV of an individual by adding the effects of all markers carried (sometimes called a SNPBLUP model) is equivalent to estimating the BV using the realized relationship among the individuals estimated from the markers (sometimes called a GBLUP). If a set of unphenotyped individuals are all equally related, they will all receive the same EBVs and so the correlation between the true BV and EBV for this set of individuals is zero. This shows the importance of variation in relationships between pairs of individuals – it is this variation that provides power to estimate BV from marker genotypes.
The best method for estimating the relationships is a slight modification of MM′ set out in Yang et al. (Reference Yang, Beben, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010). Their method has the lowest standard error for estimated relationships when the true relationships are small.
3. The accuracy of genomic selection
For individuals with marker genotypes but without phenotypic records we can calculate their EBV simply as . The accuracy of this EBV depends on two factors – the proportion of variance in the QTL explained by the markers due to LD, and the accuracy with which the b are estimated (Goddard, Reference Goddard2009).
(i) The proportion of variance in the QTL explained by the markers
The first of these factors can be quantified by the accuracy with which the relationships between individuals are estimated by the markers. Consider an infinitesimal model where there are an infinite number of QTL spread evenly over the chromosomes. If the markers are a random subset of these QTL they will estimate the relationship at the QTL except for a sampling error caused by the finite number of markers. The variance of the difference between the estimated relationship and the true relationship of a pair of individuals is called the prediction error variance (PEV). The PEV of the relationship between individuals i and j (Gij) caused by the finite number of markers (Nm) is PEV (Gij)=1/Nm (Yang et al., Reference Yang, Beben, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010). The degree by which this error degrades the estimate of the true relationship depends on the true variation in relationship. If pairs of individuals vary widely in relationship then a small error may be unimportant but if the true variation is similar to the PEV then this error will severely affect the accuracy of the estimated relationship and hence, the EBVs. If individuals vary in pedigree relationship (e.g. some are closely related and some are not) then the variation in true relationship will be great and the markers should be able to estimate these differences relatively easily. However, in that case EBVs could be calculated from the pedigree information without genetic markers. The real power of genomic selection is to estimate BV more accurately than could be done using pedigree data. Therefore, it is the variation in Gij in excess of that due to variation in pedigree that is important.
Hayes et al. (Reference Hayes, Visscher and Goddard2009b) showed how variation in realized relationship occurs among individuals with the same pedigree, such as a group of full sibs. Among pairs of full sibs, some pairs share more than 50% of their DNA and some share less than 50%. This variation around 50% only exists because genes on the same chromosome are linked and so not inherited independently – if there were an infinite number of unlinked genes, all pairs of full sibs would share 50% of their autosomes. The variation about 50% combined with phenotypes on a group of full sibs, allows us to estimate the BV of an additional full sib from the same family, even one with no recorded phenotype (Hayes et al., Reference Hayes, Visscher and Goddard2009b). The estimation of BV of an additional full sib is possible because the new individual is more closely related to some of its full sibs than to others. Full sibs inherit large segments of chromosome from their parents without recombination, so whole segments of chromosome are either shared or not shared between a pair of full sibs. Use of the relationships in this way to estimate individual's BV is equivalent to estimating the effect of chromosome segments on the trait and using these estimates to predict the BV of the additional full sib.
In a random mating population there is some variation in pedigree relationship and additional variation in realized relationship. The variance of the relationship around the pedigree relationship is approximately log(2NeL)/(2NeLc) where Ne is the effective population size, L is the average length of a chromosome in Morgans and c is the number of haploid chromosomes. If two individuals share a common ancestor, there is a probability, defined by their relationship that they both inherit an allele identical by descent (IBD) from this common ancestor. If they inherit a common allele at one locus they will also inherit common alleles at neighbouring loci due to linkage. On average, the length of this IBD segment will decrease the more distant the common ancestor is. The average time to a common ancestor increases as Ne increases, so the length of chromosome segments shared IBD decreases as Ne increases. Thus, in a population of large Ne, individuals share many small chromosome segments and the realized relationship averages out to close to the relationship expected from the pedigree. This explains the occurrence of Ne in the formula for the variance of relationship about the pedigree relationship. Therefore, for large genomes and populations with large Ne the variance of true relationship is very small and so the PEV must be small if the relationships are to be estimated with precision and this implies that a large number of markers are needed.
Using the model based on SNP effects, y=Mb+e, it is possible to estimate the total genetic variance explained by the SNPs. The same answer can be achieved by using the equivalent model based on relationships estimated from the SNPs (Yang et al., Reference Yang, Beben, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010). In either case the variance estimated will be less than the total genetic variance if the QTL are not in perfect LD with the SNPs or, equivalently, if the estimated relationship is not an unbiased estimate of the relationship at the QTL. In cattle breeds such as Holsteins, the recent Ne has been small (~100) so the variation in relationship is large and 50 000 SNPs can estimate the relationships well and so the genetic variance explained by the SNPs is close to the full genetic variance (VanRaden et al., Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009). This is equivalent to saying that the QTL genotypes can be predicted by the SNP genotypes due to LD between them. However, in humans recent Ne has been very large and so the variance of true relationships is small and even with 600,000 SNPs the PEV is significant and results in the genetic variance explained by the SNPs being only about half the known genetic variance (Yang et al., Reference Yang, Beben, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010). This is due partly to the use of a finite number of SNPs to estimate the relationship but also to systematic differences between the SNPs and QTL. If QTL and SNPs have different evolutionary histories, there may be systematic differences in the relationships at QTL and at SNPs. For instance, if QTL mutant alleles are typically eliminated by selection, they will tend to be young, and so ancient relationships estimated from the SNPs may not be relevant. An equivalent description of this situation is that QTL will have low minor allele frequency (MAF) and so cannot be in high LD with SNPs that have higher MAF. Consequently, the SNPs will explain less of the genetic variance of a trait than expected simply by accounting for the PEV due to a finite number of SNPs. Yang et al. (Reference Yang, Beben, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010) found that SNPs only explained about half the genetic variance for human height but they should have explained 80% if the QTL had behaved like SNPs.
Since the true variance in relationships is dependant on NeLc and the PEV with which it is estimated is 1/Nm, it is not surprising that the accuracy of EBVs depends on Nm/(NeLc) (Meuwissen, Reference Meuwissen2009; Meuwissen & Goddard, Reference Meuwissen and Goddard2010).
In the argument above we assumed an infinite number of QTL. However, BLUP estimates of EBVs are insensitive to the true genetic model for the trait and give a similar accuracy regardless of the actual number of QTL governing variation in the trait (Meuwissen & Goddard, Reference Meuwissen and Goddard2010).
(ii) The accuracy with which the marker effects are estimated
If the BLUP model is used to estimate SNP effects, the standard theory provides estimates of the accuracy of the estimated SNP effects and the EBVs of individuals provided that the correct variance components are used. To the extent that the SNPs do not explain all of the genetic variance, an additional ‘polygenic’ term (u) should be included in the statistical model with V(u)=Aσu2 where A is the relationship matrix constructed from pedigree information and σu2 is the genetic variance not explained by the SNPs (Hayes et al., Reference Hayes, Bowman, Chamberlain and Goddard2009a). In the equivalent model based on realized relationships the G matrix should be estimated by regressing MM′ back towards A to account for the error in the relationships estimated by MM′.
If all SNPs were independent (i.e. no LD) then the accuracy of estimating any one SNP effect is approximately where n is the number of animals with genotypes and phenotypes and λ=σ2/σb2, where σ2 is the phenotypic variance, that is, the variance of y. However, the LD between SNPs and QTL located close together on a chromosome causes a segment of chromosome to act almost as a block and the accuracy of estimating the effect of the block is given by the above formula but with λ=sσ2/σg 2, where s is the effective number of chromosome segments. The best value to use for s has not been fully resolved but is approximately 2NeLc/log(2NeL) (Hayes et al., Reference Hayes, Visscher and Goddard2009b). The accuracy of estimating a single SNP effect also equals the accuracy of estimating that part of the BV that is predicted by the SNPs, which is the sum of many SNP effects.
If the SNP effects (b) do not follow a normal distribution, the accuracy achieved using the BLUP model is relatively unaffected. However, greater accuracy can be achieved by a statistical method whose assumption about the distribution of b more closely approximates the true distribution. For instance, if a trait is controlled by a small number of QTL, some of which have a moderately large effect, then Bayes B yields higher accuracy than the BLUP analysis (Verbyla et al., Reference Verbyla, Bowman, Hayes and Goddard2009). This is not surprising because Bayes B assumes that many of the SNP effects are zero and the remainder follow a scaled t-distribution which allows for some larger than normal effects. However, whether a BLUP or Bayes B analysis is used, λ is still a key parameter in determining the accuracy (Meuwissen & Goddard, Reference Meuwissen and Goddard2010).
Unfortunately, we do not know the true distribution of apparent SNP effects but for some traits there are clearly a small number of QTL with effects that are larger than would be sampled from a normal distribution. Also it seems likely that as the number of SNPs used increases the assumption that many have zero effect is more likely to be true. Therefore, Bayes B seems to be widely useful – it seldom performs worse than BLUP and sometimes is significantly better (see experimental results discussed later in this paper).
Many empirical methods have been tried for predicting BV from SNP genotypes (e.g. Gianola et al., Reference Gianola, de Los Campos, Hill, Manfredi and Fernando2009; Moser et al., Reference Moser, Tier, Crump, Khatkar and Raadsma2009). In most cases to date many methods give similar accuracy. However, it seems logical to attempt to use an explicit assumption about the distribution of b and to make this assumption as close to reality as possible. Most of these methods imply an additive model of QTL effects. This seems appropriate when the aim is to estimate BV because this is by definition a linear combination of QTL effects. However, if the aim was to estimate total genetic value a model assuming non-additive genetic effects might be better. Non-additive effects can be included in the model explicitly or a non-parametric or semi-parametric method such as kernel regression may be used (Gianola et al., Reference Gianola, Fernando and Stella2006). Lee et al. (Reference Lee, van der Werf, Hayes, Goddard and Visscher2008) showed that mouse colour could be predicted better by including dominance in the model but the difficulty with such non-additive models is likely to be the inability to estimate numerous small effects that typically explain a small amount of the variance (Hill et al., Reference Hill, Goddard and Visscher2008).
The methods to estimate BV from marker genotypes presented in this paper have a natural Bayesian interpretation which includes a prior distribution of marker effects. However, very similar methods can be derived from non-Bayesian perspectives. For instance, they can also be derived as the expected value of BV in a frequentist setting where marker effects are regarded as random samples from a population of random effects. Penalized least squares and other machine learning methods can also yield similar results (Moser et al., Reference Moser, Tier, Crump, Khatkar and Raadsma2009).
(iii) Experimental results
The accuracy of genomic prediction of BV has been assessed by estimating a prediction equation using one dataset and then testing the prediction in a second independent dataset. When this has been done the results are qualitatively in line with the theory above. For instance Wiggans et al. (Reference Wiggans, Cooper, VanRaden and Silva2010) observed the accuracy to increase from 0·80 to 0·84 as the number of records (n) used increased from 3700 to 7173.
In many respects, the conditions examined by VanRaden et al. (Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009) and Wiggans et al. (Reference Wiggans, Cooper, VanRaden and Silva2010) are the most favourable. All the animals were within one breed of cattle (Holstein) and the recent effective population size of this breed is low (~100). This means that the variation in true relationship is large or equivalently that the LD between SNPs and QTL is high, so that the approximately 40 000 SNP explain most of the genetic variance. Also, the low Ne means that the effective number of chromosome segments (s) is small and so the accuracy of estimating their effects is high. This is further aided by the use of progeny tested sires as the experimental animals since they have relatively accurate estimates of BV and so the residual error in the data (σe 2) is low.
Experiments in other livestock have not yielded such high accuracy. For instance, in sheep breeds the accuracy achieved has been lower than reported in Holsteins (Daetwyler et al., Reference Daetwyler, Hickey, Henshal, Dominik, Gredler, van der Werf and Hayes2010). This is expected because the number of animals with marker genotypes and phenotypes is smaller, these animals belong to multiple breeds and the phenotypic data consists of individual animal phenotypes instead of the progeny test used in the Holstein case. In humans, where recent Ne is very large, the accuracy of predicting phenotype has been low despite large datasets (Manolio et al., Reference Manolio, Collins, Cox, Goldstein, Hindorff, Hunter, McCarthy, Ramos, Cardon, Chakravarti, Cho, Guttmacher, Kong, Kruglyak, Mardis, Rotimi, Slatkin, Valle, Whittemore, Boehnke, Clark, Eichler, Gibson, Haines, Mackay, McCarroll and Visscher2009). However, formal prediction methods such as Bayes B have not been attempted.
As the number of records increases in other breeds and species the accuracy of the EBVs is expected to increase as it did in Holsteins. However, for many breeds and species it may not be possible to assemble such high quality datasets as has been done for Holsteins. In these cases it would be desirable to combine data from several breeds within a species. This is only beneficial if the phase of LD between SNPs and QTL is the same in different breeds. This is not the case when 50 000 SNPs are used in cattle breeds (de Roos et al., Reference de Roos, Hayes, Spelman and Goddard2008) but consistent LD phase should occur if denser SNPs are used (e.g. 500 000). Unfortunately, when multiple breeds are used, the effective number of chromosome segments (s) increases, implying that even larger datasets are needed. Therefore, we expect that the increased accuracy achieved from high density SNP panels will be greater if methods such as Bayes B, which assume many SNPs have zero effect, are used.
Methods such as Bayes B do yield higher accuracy of EBV than BLUP in traits with segregating QTL of moderate effect (Hayes et al., Reference Hayes, Pryce, Chamberlain, Bowman and Goddard2010). For instance, EBVs for fat concentration in milk and proportion of white colour in the coat of Holstein cattle where more accurate when Bayes B was used than when BLUP was used and there are known genes segregating which effect these traits (Hayes et al., Reference Hayes, Pryce, Chamberlain, Bowman and Goddard2010).
4. Implementation of genomic selection
In most livestock breeds there are systems in place to calculate EBVs from traditional phenotypic records and pedigrees. In dairy cattle these operate at a national and international level. The use of DNA data to increase the accuracy of EBVs needs to be integrated into these existing systems. At present most systems have very large databases of traditional phenotypic records (from millions of animals) and comparatively small databases of SNP genotypes. Consequently strategies have been devised to minimize the additional computing load in the analysis of the large database. For instance, the SNP genotypes can be combined in a prediction equation to yield an estimate of BV coming only from SNPs. This has been called a direct genetic value (DGV) or marker breeding value (MBV). A selection index is then used to combine this estimate with that generated by the traditional analysis, which does not use SNPs at all, resulting in a final published EBV (Harris & Johnson, Reference Harris and Johnson2010). Alternatively, the DGV can be treated as an additional trait, genetically correlated with the phenotypic trait, in a multi-trait BLUP analysis of the large dataset. This has the advantage that it propagates the DGV to the relatives of an animal with SNP genotypes but at the cost of increased computing burden. A third method is to use the equivalent model based on relationships. For animals with SNP genotypes these are used to calculate the relationship and for other animals the pedigree relationship is adjusted for the knowledge contained in the relationships based on genotypes (Legarra & Misztal, Reference Legarra and Misztal2008). This method requires raw genotypes rather than DGVs and it is most useful if the BLUP method of estimating SNP effects is to be used. However, other methods such as Bayes B could be used by calculating a relationship matrix from the SNPs but weighting the SNPs according to the variance associated with them. If, in the future, large numbers of animals have SNP genotypes, it may be that the genetic evaluation system will be completely changed to one that uses genetic markers rather than pedigree relationships, as suggested by Goddard (Reference Goddard1998).
If the data contains many animals each with many genotypes, then MCMC methods to estimate SNP effects can take so much computer time as to become impractical. Approximations to Bayes B that use an EM algorithm instead of sampling (Shepherd et al., Reference Shepherd, Meuwissen and Woolliams2010) may overcome this problem.
5. The design of breeding programmes that utilize genomic selection
Marker assisted selection is most useful for traits which cannot be recorded on an individual prior to the (minimum) age of breeding (Meuwissen & Goddard, Reference Meuwissen and Goddard1996). For instance, traits which are only displayed in females or only observable late in life or after slaughter benefit most. Traditionally traits such as milk yield, which is not displayed by bulls, have been improved by progeny testing bulls based on their daughters’ milk yield. This leads to an accurate estimate of the bull's BV but at the expense of a long generation interval. The benefit of genomic selection is that bulls and heifers can be selected early in life and the generation interval reduced leading to approximately doubling genetic gain per year (Schaeffer, Reference Schaeffer2006; König et al., Reference König, Simianer and Willam2009; Pryce et al., Reference Pryce, Goddard, Raadsma and Hayes2010). This radically changes the design of dairy breeding programmes which have been based on progeny testing. By using genetic markers and genomic selection we can select the best bulls when they are born and breed from them at 1 year of age instead of waiting until they have completed a progeny test at 5 years of age. Despite the large change in breeding programmes needed to capture the benefit of genomic selection it is being widely adopted. In the USA, 52 786 Holstein dairy cattle alone had been genotyped with a SNP chip up to September 2010 (George Wiggans, personal communication).
In developing countries it has been hard to implement traditional genetic improvement programmes because they are logistically complex especially if they require recording the pedigree and production of thousands to millions of animals. Genomic selection might be more practical than traditional selection in these countries. The development of a prediction equation would still require recording the performance of many animals but pedigree would not be required and implementation would require only a DNA sample from each selection candidate and the laboratory facilities to genotype SNPs and compute EBVs from them.
6. Long-term response to genomic selection
If a prediction equation is estimated in the base generation and used for selection for several subsequent generations, simulation studies show that the selection response declines rapidly (Muir, Reference Muir2007). Goddard (Reference Goddard2009), shows that this is due to two processes. First, selection drives the selected SNP allele towards fixation more quickly than the favourable QTL allele so that the LD between them, which genomic selection relies on, diminishes. Second, traditional mass selection on phenotype does not result in a rapid decline in genetic variance because increasing the frequency of initially rare favourable alleles compensates for the movement towards fixation of common, favourable alleles. However, genomic selection is unlikely to select effectively for rare alleles because they are poorly correlated with the common SNPs. The decline in rate of response to genomic selection is likely to be slower if the trait is controlled by very many genes, each with very small effects, because then the change in allele frequency will be slower.
This reduction in response over time can be reduced in a number of ways. Re-estimating the prediction equation each generation would partially prevent the decline in response (Muir, Reference Muir2007). Goddard (Reference Goddard2009) presented a method to optimize long-term response which decreases selection pressure on QTL that are initially common and of large effect, compared with selection on EBV alone. When very high density SNP genotyping and the Bayes B method of estimation of SNP effects is used, only SNPs that are in close LD to the QTL obtain estimated effects ≠0, with accuracies that persist over time, since the LD persists over time (Meuwissen & Goddard, Reference Meuwissen and Goddard2010).
Long-term response is also reduced by inbreeding which of course also causes inbreeding depression. In traditional selection it is possible to balance maximizing the EBV of selected animals with minimizing long-term inbreeding by optimizing the contribution of individual animals to the next generation (Wray & Goddard, Reference Wray and Goddard1994; Meuwissen, Reference Meuwissen1997). By using the relationship matrix estimated from the SNPs this method can be extended to genomic selection (Sonesson & Meuwissen, Reference Sonesson, Woolliams and Meuwissen2010b).
7. Genomic selection in plants and aquaculture
In principle genomic selection could be applied to crops and species used for aquaculture as well as to livestock. However, some practical problems are likely to occur. Some species have very large Ne in the wild and hence, the LD extends over a very short distance. This means that very dense SNP genotyping would be necessary and possibly very large sample sizes as well. This may be uneconomical especially where individual plants or fish have a small value. To overcome these problems it may be necessary to reduce Ne in a breeding programme, for instance, by using only the best families or existing varieties to breed the new strain. A novel design has been suggested (Sonesson et al., Reference Sonesson, Meuwissen and Goddard2010), where estimation of SNP effects is based on the genotyping of DNA pools and SNP density is reduced by estimating SNP effects within one or a few families.
Deliberating reducing Ne could lead to faster inbreeding and inbreeding depression but this can be avoided when the commercial animal or plant is a cross between two or more lines. Reciprocal recurrent selection is selection within pure strains based on the performance of their crossbred offspring. This is a selection method which increases crossbred performance and heterosis and thus can be described as minimizing inbreeding depression. Genomic selection would be particularly useful for reciprocal recurrent selection because it would eliminate the need for a progeny test and therefore, reduce generation interval.
8. Genomic prediction in humans
In humans the same techniques for predicting genetic value, from a genome wide panel of SNPs, could be used to predict the genetic risk of a particular disease that an individual faces (Wray et al., Reference Wray, Goddard and Visscher2007). This could be a more accurate prediction than that already made from family history and used in disease prevention, diagnosis, treatment and counselling. However, the high recent Ne of humans implies that many SNPs and a reference population with very many people will be needed to achieve a highly accurate prediction. It is hoped that use of methods such as Bayes B that identify the SNPs in LD with causative variants will lead to higher accuracy methods such as BLUP.
9. Mapping and identifying QTL
The genome wide SNP genotypes that are used in genomic selection are also used in genome wide association studies (GWAS) to map genes for complex traits (Goddard & Hayes, Reference Goddard and Hayes2009). Typically in a GWAS each SNP is tested for an association with the traits ignoring all other SNPs. Consequently the association at one SNP could reflect the action of more than one QTL and so the SNP with the largest association may not be the closest SNP to a QTL. In genomic selection all the SNPs are fitted at once which may result in more precise mapping of the QTL. However, when the BLUP model of SNP effects is used many SNPs are estimated to have small effects and the position of the QTL is again blurred. But when a method such as Bayes B is used the SNPs with large effects might be good indicators of the position of QTL, (Verbyla et al., Reference Verbyla, Bowman, Hayes and Goddard2009).
10. Future developments
The cost of genome sequencing is dropping rapidly so in the near future sequence data on individuals will supplement SNP genotype data. This will increase the accuracy of EBVs because it will provide very dense markers and will include the causal mutations (Meuwissen & Goddard, Reference Meuwissen and Goddard2010). Only a sample of individuals from any species of livestock will be sequenced but other animals will have sequence imputed from SNP genotypes using the database of sequenced animals as a reference. This will provide a large number of animals with phenotypic records and imputed genome sequence and this should constitute a powerful resource for discovering the causal mutations. This should lead to development of prediction equations that persist across generations and across breeds, since the LD between the SNPs and the QTL is (nearly) complete.