1. Introduction
Population genomics is a new term for a field of study that is as old as the field of genetics itself, assuming that it means the study of the amount and causes of genome-wide variability in natural populations. Indeed, the problem of characterizing natural variability greatly concerned Charles Darwin, since evolutionary change under natural selection requires the existence of heritable variation in the traits in question. By drawing on evidence from domesticated species of animals and plants, Darwin succeeded in demonstrating the existence of such variability in both quantitative and discrete traits (Darwin, Reference Darwin1859, Reference Darwin1868).
No progress in understanding the causes of this variability was made, however, until the rediscovery of Mendelian genetics at the beginning of the 20th century. The Mendelian basis for the inheritance of naturally occurring discrete polymorphisms, such as the ABO blood groups of humans (Bernstein, Reference Bernstein1925), Batesian mimics in Papilio butterflies (Punnett, Reference Punnett1915) and heterostyly in Primula (Bateson & Gregory, Reference Bateson and Gregory1905), was quickly established. However, discrete variation that is easily detectable at the phenotypic level is comparatively uncommon, except for rare deleterious mutations. In contrast, quantitative variation in meristic or metric traits is abundant, but less amenable to genetic analysis. By the 1920s, the joint control of quantitative of variation by non-genetic effects and multiple Mendelian genes was firmly established (Provine, Reference Provine1971), and it could confidently be assumed that the vast bulk of heritable variation reflects the effects of underlying Mendelian variants carried on the chromosomes (Muller & Altenburg, Reference Muller and Altenburg1920).
In addition, quantitative studies of the effects of inbreeding, such as those of Sewall Wright on guinea pigs (Wright, Reference Wright1922), provided evidence for the existence of ‘concealed variability’, which is exposed when recessive alleles are made homozygous by matings between close relatives. The introduction of H. J. Muller's ‘balancer’ crossover suppressor chromosomes (Muller, Reference Muller1928) into Drosophila population genetics in the 1930s, allowing the forced homozygosity of whole chromosomes derived from natural populations, yielded the startling conclusion that wild individuals of Drosophila species carry an average of over one recessive lethal mutation in the heterozygous state (Simmons & Crow, Reference Simmons and Crow1977). Ingenious competition experiments devised by John Sved later showed that homozygosity for a lethal-free haploid D. melanogaster genome reduces fitness to effectively zero under laboratory conditions (Sved, Reference Sved1971; Latter & Sved, Reference Latter and Sved1994).
While confirming Darwin's view that there is plenty of genetic variability available for use in evolution, the evidence obtained by these methods left two important questions unanswered (Lewontin, Reference Lewontin1974). First, how much variation within a natural population is there at an average gene locus? Two radically different hypotheses had been proposed during the 1950s. Under the ‘classical’ view, most genes have a high-frequency, wild-type allelic form, v. some rare deleterious variants caused by mutation, like the recessive lethals described above (Muller, Reference Muller1950). In contrast, the ‘balance’ hypothesis proposed that genes typically have several different allelic forms that segregate at intermediate frequencies, like the polymorphisms mentioned earlier (Dobzhansky, Reference Dobzhansky1955). The second question concerned the extent to which the frequencies of variants within populations (other than rare deleterious mutations) are controlled by natural selection, as envisaged under the balance hypothesis, v. reflecting an interaction between mutation and random genetic drift, as proposed by the ‘neo-classical’ view (Kimura & Crow, Reference Kimura and Crow1964).
The methods of classical and quantitative genetics provide no means of sampling variation randomly from the genome, so that the first question cannot be answered by them. However, as shown below, modern DNA sequencing technology allows a virtually complete answer. The data provided by this technology also provide information that should allow the second question to be answered, but we are still some way from being confident that we know the answer. In this paper, I will give a brief historical review of how the first question has been answered, and then discuss methods that have been developed to answer the second question. Due to space constraints, only some of these can be described in any detail. In particular, I will not trace the development of the theory of the coalescent process, which has played a fundamentally important role in many of the modern methods of statistical testing and inference in population genomics; excellent reviews of coalescent theory are provided by Hein et al. (Reference Hein, Schierup and Wiuf2005) and Wakeley (Reference Wakeley2008). Some questions that are raised by studies of DNA sequence variation and evolution will also be considered.
2. Molecular genetics to the rescue of population genetics
(i) The pre-DNA era – gel electrophoresis of proteins
The first estimates of genome-wide levels of variation in populations were obtained in 1966, by using the then recent discovery that most genes correspond to stretches of DNA that code for polypeptides. Detection of variation in the sequence of a polypeptide allows us to infer the existence of variation in the corresponding DNA sequence. John Hubby and Richard Lewontin applied this idea to samples from natural populations of Drosophila pseudoobscura (Hubby & Lewontin, Reference Hubby and Lewontin1966; Lewontin & Hubby, Reference Lewontin and Hubby1966), and Harry Harris independently applied it to humans (Harris, Reference Harris1966). Both groups used gel electrophoresis to screen populations for variants that affect the migration rates of proteins on a gel exposed to an electric current. Many different soluble proteins controlled by independent genes were studied, mostly enzymes with well-understood metabolic roles. The proteins were chosen purely because they could be studied easily, with no bias with respect to any prior knowledge of their level of variability.
These papers also introduced ways of summarizing the results of genotyping numerous individuals at dozens of loci, by means of measures such as the proportion of polymorphic loci, P (i.e. loci with at least one minority variant with a frequency greater than a cut-off such as 1 or 5%), and the genic diversity, H (often referred to as the ‘heterozygosity’), measured by the mean over the set of loci of the frequency with which two randomly sampled alleles at a locus differ in state. These pioneering studies resulted in hundreds of ‘find 'em and grind 'em’ surveys of natural variability (Lewontin, Reference Lewontin1974, Reference Lewontin1985). These surveys estimated that a large fraction (e.g. 43% in D. pseudoobscura, 28% in humans) of loci is usually polymorphic, and that H is of the order of a few percent (12% for D. pseudoobscura and 7% for humans). This apparently overthrew the classical view of genetic variability.
But these results had several limitations. Most importantly, only amino acid changes that affect the mobility of proteins in gels (mostly associated with charge changes) can be detected by electrophoresis; these probably represent only about one-third of the total possible mutational changes in the amino acid sequence of a protein (Lewontin, Reference Lewontin1985). Additionally, only changes in DNA sequences that affect protein sequences can be observed.
Furthermore, attempts to use data on electrophoretic variability to test the neo-classical hypothesis that polymorphic variants are neutral (Kimura & Crow, Reference Kimura and Crow1964; Kimura & Ohta, Reference Kimura and Ohta1971), v. the alternative that they are maintained by balancing selection (Lewontin & Hubby, Reference Lewontin and Hubby1966), proved frustratingly inconclusive. Lewontin (Reference Lewontin1974, p. 189) remarked that’
‘For many years, population genetics was an immensely rich and powerful theory with virtually no suitable facts on which to operate. …. Quite suddenly the situation has changed … and facts in profusion have been poured into the hopper of this theory machine. And from the other end has issued – nothing. It is not that the machinery does not work, for a great clashing of gears is audible, if not deafening, but it somehow cannot transform into a finished product the great volume of raw material that has been provided. The entire relationship between the theory and the facts needs to be reconsidered.’
By the end of the 1970s it was clear that studies of variation at the level of the DNA sequence itself would be needed to deal with these problems. In addition, theoretical population geneticists, starting with Warren Ewens (Ewens, Reference Ewens1972), had begun to produce theoretical models that predicted the properties of samples from a population, as opposed to modelling the population as a whole, greatly improving our ability to test hypotheses against data (for later developments, see Hein et al. Reference Hein, Schierup and Wiuf2005; Wakeley, Reference Wakeley2008; Charlesworth & Charlesworth, Reference Charlesworth and Charlesworth2010). Together, advances in both experimental techniques and theoretical methods have led to a much less pessimistic assessment of the prospects for solving the problem of the causes of variation.
(ii) The beginning of the DNA revolution–restriction mapping
It is probably hard for people brought up with the wonders of PCR amplification, automated Sanger sequencing, and now next-generation sequencing, to realize how difficult it was for geneticists to develop tools for studying variation at the DNA level, especially because relatively large amounts of cellular material were needed for molecular characterization of an individual before PCR amplification was available.
The first studies of DNA sequence variation in organelle and nuclear genes were done in the later 1970s, using restriction enzymes to detect variation at sites that could be cut by them (see reviews by Avise, Reference Avise, Nei and Koehn1983; Kazazian et al., Reference Kazazian, Chakravarti, Orkin, Antonarakis, Nei and Koehn1983; Nei, Reference Nei, Nei and Koehn1983). With nuclear genes, Southern blotting with probes derived from cloned genes was needed to pin down the region of interest, making it harder to obtain data.
D. melanogaster was especially amenable to this approach, because stocks of flies homozygous for any of the three major chromosomes could be made using balancer chromosomes (see Introduction section), providing plenty of material for isolating DNA from a single haploid genome. During the 1980s, Chuck Langley's laboratory was especially active in generating surveys of natural variability at different locations in the genome by this method, providing the first overview of genome-wide variation (Langley et al., Reference Langley, Montgomery and Quattlebaum1982, Reference Langley, Shrimpton, Yamazaki, Miyashita, Matsuo and Aquadro1988; Aquadro et al., Reference Aquadro, Deese, Bland, Langley and Laurie-Ahlberg1986; Langley & Aquadro Reference Langley and Aquadro1987; Schaeffer et al., Reference Schaeffer, Aquadro and Langley1988; Miyashita & Langley, Reference Miyashita and Langley1988; Aguadé, et al., Reference Aguadé, Miyashita and Langley1989b; Stephan & Langley, Reference Stephan and Langley1989; Aguadé et al., Reference Aguadé, Miyashita and Langley1992). Parallel studies were done in humans using cultured cells to provide material, starting with work on the human β-globin gene cluster (Kan & Dozy, Reference Kan and Dozy1978; Orkin et al., Reference Orkin, Kazazian, Antonorakis, Goff, Boehm, Sextor, Waber and Giardina1982).
The analysis of data from restriction sites is not straightforward, since restriction maps need to be constructed for each haploid genome region to be characterized. Furthermore, since the only information provided is the presence or absence of the short sequences recognized by the enzymes, algorithms had to be developed to translate the restriction site variation observed in a sample into estimates of nucleotide site diversity (π), the per nucleotide site equivalent of H, which was introduced by Masatoshi Nei and Wen-Hsiung Li (Nei & Li, Reference Nei and Li1979) when analysing restriction site data; these are reviewed by Nei (Reference Nei1987, chapter 10). A considerable advantage of this approach was that a fairly large genomic region could be surveyed (13 kb in the case of the D. melanogaster Adh region and 50 kb in the case of the human β-globin gene cluster); in addition to changes between alternative nucleotides at a site, transposable element (TE) insertions, insertion/deletion (indel) polymorphisms and chromosome rearrangements could also be identified.
It is impressive how much information was gleaned from these early studies; for example, the results from both Drosophila and humans showed that what are now known as single nucleotide polymorphisms (SNPs) contributed the most to variability, in terms of events per nucleotide site, while TE insertions contributed low-frequency polymorphisms in Drosophila but almost nothing to human variability. It is interesting to note that Alan Robertson and Bill Hill (Robertson & Hill, Reference Robertson and Hill1983) used the results of Orkin et al. (Reference Orkin, Kazazian, Antonorakis, Goff, Boehm, Sextor, Waber and Giardina1982) to infer that the mean nucleotide site diversity in humans implies an effective population size of 20 000, and that the disagreement between the observed level of linkage disequilibrium and the theoretical formula for its magnitude under drift and recombination suggested that ‘crossing over is not homogeneous along the DNA sequence’. These findings were essentially confirmed by the results of much larger, more recent, studies of DNA sequence variability (see below). Furthermore, the Drosophila studies showed that regions of the genome with low recombination had unusually low levels of genetic variability (Aguadé et al., Reference Aguadé, Miyashita and Langley1989a; Stephan & Langley, Reference Stephan and Langley1989); more generally, a positive correlation was to be found between the local rate of recombination experienced by a gene, in terms of map units per unit of physical distance, and the level of genetic variability (Begun & Aquadro, Reference Begun and Aquadro1992), a relationship that has also stood the test of time (see the Discussion).
(iii) The rise of DNA sequencing
With the invention of PCR amplification for isolating specific regions of DNA with the aid of sequence-specific primers, and with the introduction of sequencing machines, DNA sequencing of multiple copies of the same region or set of regions of the genome (resequencing, as it has come to be called) became the method of choice for surveying DNA sequence variation. Before these methods became available, the first thorough study of variability by the use of sequencing was that of the Adh gene of D. melanogaster by Marty Kreitman (Kreitman, Reference Kreitman1983), who applied the very laborious procedure of manual Maxam–Gilbert sequencing to stocks of 11 independently isolated chromosomes made homozygous by a balancer. (This was not a truly random sample from the population, as there were approximately equal numbers of the fast and slow electrophoretic alleles.)
A significant finding of this study, confirmed by later resequencing work, was that most of the variability involved silent changes that do not affect the protein sequence: the sequence differences were either in non-coding regions, or were coding sequence changes that did not affect the amino acid sequence (synonymous variants). Indeed, in Kreitman's D. melanogaster Adh gene survey, only one amino acid polymorphism was detected – the one previously known to cause the difference between the fast and slow alleles. About 39 amino acid variants would have been found if the same level of variability applied to both silent and non-synonymous variants (Kreitman, Reference Kreitman1983). This agreed with the results of contemporary analyses of the molecular evolution of DNA sequences, which had shown a pattern of much slower rate of evolution per nucleotide site for non-synonymous compared to silent changes, e.g. Kimura (Reference Kimura1983, chapter 4). These results led to the by-now familiar conclusion that the majority of new mutations that change the amino acid sequence have such large deleterious effects on fitness that they contribute little to either within-population variation or divergence between species, compared with silent changes to the DNA sequence.
The methods used for characterizing variability at the DNA sequence level have advanced steadily due to technical advances and continual reductions in costs; whereas surveys of more than a dozen or so genes were prohibitively expensive as recently as the year 2000, except for well-funded investigators of humans and medically important microbes, even financially hard-pressed Drosophila population geneticists have become been able to characterize samples of individuals for hundreds of roughly 500 bp sequences (the sizes of reads from an automated Sanger sequencer), from different sites around the genomes (e.g. Andolfatto, Reference Andolfatto2007; Hutter et al., Reference Hutter, Li, Beisswanger, De Lorenzo and Stephan2007). While it is very expensive to scale this up to whole genomes, analyses of resequenced whole genomes of microbes such as yeast (Liti et al., Reference Liti, Carter, Moses, Warringer, Parts, James, Davey, Roberts, Burt, Durbin and Louis2009; Schacherer et al., Reference Schacherer, Shapiro, Ruderfer and Kruglyak2009) and even D. simulans (Begun et al., Reference Bengun, Holloway, Stevens, Hillier, Poh, Hahn, Nista, Jones, Kern, Dewey, Pachter, Myers and Langley2007) have now been published.
These studies show that the overall levels of diversity at silent nucleotide sites within a species, which are the least likely to be influenced by differences in selective constraints, vary enormously among different taxa from a low of about 0·1% for humans to a high of over 8% for the seasquirt Ciona intestinalis (estimated by comparing the two haploid genomes of a single individual: Small et al., Reference Small, Brudno, Hill and Sidow2007): see Figure 1.10 of Charlesworth & Charlesworth (Reference Charlesworth and Charlesworth2010). Although levels of diversity per nucleotide site are usually small, even the small, gene-dense genome of the bacterium Escherichia coli has 4·2 million nucleotide sites, including around 900 000 silent or synonymous sites (Blattner et al., Reference Blattner, Plunkett, Bloch, Perna, Burland, Riley, ColladoVides, Glasner, Rode, Mayhew, Gregor, Davis, Kirkpatrick, Goeden, Rose, Mau and Shao1997). Given that the mean silent site diversity in E. coli is about 2% (Charlesworth & Eyre-Walker, Reference Charlesworth and Eyre-Walker2006), it can be estimated that there are likely to be about 82 900 silent SNPs among 100 E. coli genomes (Charlesworth & Charlesworth, Reference Charlesworth and Charlesworth2010, p. 31). Several million SNPs (non-coding, synonymous and non-synonymous) are present genome-wide in the populations of most multicellular organisms, with their much larger genomes. There is thus a wealth of genetic variation in natural populations that was undreamt of in the days of electrophoresis, even disregarding the contribution of insertions, deletions and copy number variants, whose importance is becoming increasingly recognized (e.g. Iafrate et al., Reference Iafrate, Feuk, Rivera, Listwenik, Donahoe, Qi, Scherer and Lee2004; Emerson et al., Reference Emerson, Cardoso-Moreira, Borevitz and Long2008; Kidd et al., Reference Kidd, Cooper, Donahue, Hayden, Sampas, Graves and Eichler2008).
With the introduction of high-throughput sequencing methods there will shortly be large-scale studies of both human populations and model organisms like D. melanogaster and Arabidopsis thaliana, where hundreds or thousands of independent genomes are being resequenced (see http://www.1000genomes.org; http://www.dpgp.org; http://www.hgsc.bcm.tmc.edu/project-species-i-Drosophila_genRefPanel.hgsc; http://1001genomes.org); a preliminary study of D. melanogaster has already been published (Sackton et al., Reference Sackton, Kulathinal, Bergman, Quinlan, Dopman, Carneiro, Marth, Hartl and Clark2009). We will soon have the finest-scale resolution that is possible. This throws the ball firmly into the court of theoretical population geneticists and statistical geneticists, to provide both a theoretical framework for interpreting any patterns discerned in the data, and quantitative tools for testing hypotheses against the data. If Lewontin's machinery does not work with this material, then we should give up! But the approaches reviewed below suggest that there are reasons to hope that the retooled machinery will perform rather well.
3. Inferring the causes of genome-wide variation
Simply estimating mean levels of sequence diversity across the genome tells us nothing about the forces involved in creating and maintaining it. We would like to know to what extent variability for different classes of variants (non-synonymous, synonymous and non-coding) can be accounted for by the neutral model, according to which genetic drift acts on selectively equivalent or nearly equivalent types of variant (Kimura, Reference Kimura1983). If selection needs to be invoked, what kind of selection typically operates, and what is its intensity? Do other evolutionary forces, such as biased gene conversion (BGC) or meiotic drive play a significant part?
(i) Mutation
On almost any view of evolution other than a neo-Lamarckian one (currently, but in my opinion unconvincingly, being advocated by a vocal minority: e.g. Jablonka & Raz, Reference Jablonka and Raz2009), both neutral and adaptive evolution depend on new mutations, genetic or epigenetic, that arise during the transmission of the genetic material from parent to offspring. The ability to characterize the sequences of whole genomes, or large portions of genomes, is revolutionizing our knowledge of the mutational process, both in terms of the rates of occurrence of mutations and the relative frequencies of different types of mutational change (reviewed by Lynch, Reference Lynch2010). These studies show that, despite low mutation rates per base pair per generation in the nuclear genomes of multicellular organisms (of the order of 10−9–10−8), the per-genome mutation rate per generation in short-lived species with relatively small genomes, such as Drosophila, is substantially higher than one new mutation per zygote per generation, and probably about 100 in humans, with their longer generation time and much larger genome. In addition, there is an almost universal bias in favour of mutational changes from GC base pairs to AT base pairs, compared with AT to GC, and for transitions over transversions. By far the most frequent type of mutation is represented by single nucleotide substitutions, followed by small insertion/deletions (indels).
While these conclusions are not radically new, they are now based solidly on direct evidence. This provides an essential underpinning for interpreting the results of population-level studies of variation. In addition, the results can be combined with estimates of the levels of selective constraints on amino-acid sequences and non-coding sequences, to yield estimates of the mean number of new deleterious mutations arising per individual per generation, U (Kondrashov & Crow, Reference Kondrashov and Crow1993; Haag-Liautard et al., Reference Haag-Liautard, Dorris, Maside, Macaskill, Halligan, Houle, Charlesworth and Keightley2007; Eory et al., Reference Eory, Halligan and Keightley2010; Lynch Reference Lynch2010). This quantity plays a major role in theories of the evolution of genetic recombination, and the causes of inbreeding depression and ageing. It seems now well established from these studies that U is much larger than one for humans and about one for Drosophila. No doubt these estimates will be revised in the future with more data, but it seems as though we are close to settling a long-running dispute about the magnitude of U in higher organisms (for the background to this controversy, see Keightley & Eyre-Walker, Reference Keightley and Eyre-Walker1999; Lynch et al., Reference Lynch, Blanchard, Houle, Kibota, Schultz, Vassilieva and Willis1999).
(ii) Selection on non-synonymous mutations
(a) The neutral null model
Partly reflecting the fact that evolutionary and population studies of protein sequence differences have a much longer history than those of DNA sequences, a perhaps disproportionate amount of attention has been devoted to non-synonymous variation. It should be emphasized that the neutral theory as developed by Motoo Kimura and Tomoko Ohta (Kimura & Ohta, Reference Kimura and Ohta1971; Kimura, Reference Kimura1983; Ohta, Reference Ohta1992) included the fact that most non-synonymous variants are sufficiently selectively deleterious that they have essentially no chance of fixation by genetic drift in opposition to selection, i.e. their N es is usually substantially greater than 1, where N e is the effective population size, and s is the selection coefficient against a deleterious non-synonymous mutation (measured in the heterozygous state with wild type, in the case of a diploid, randomly mating population).
The neutral theory proposes that the majority of non-synonymous variants that become fixed during evolution are the result of drift acting on neutral or nearly neutral mutations (i.e. those with N es<1). Variants within populations are thus mainly either neutral or slightly deleterious, and are destined to ultimate fixation or loss by drift. Classical theoretical results on the neutral theory are as follows. The rate of sequence divergence between species per nucleotide site is equal to the corresponding mutation rate u. If N eu≪1, the equilibrium level of nucleotide site diversity, π, under mutation and drift is equal to 4N eu. The probability that a polymorphic mutation is found at frequency q in a sample (the ‘site frequency spectrum’ or SFS) is proportional to 1/q.
These classical findings provide the basis for many different tests of the neutral theory. A serious limitation, however, is that they assume that neutral sites are in statistical equilibrium under mutation and genetic drift in a panmictic population. While the assumption of panmixis is probably reasonably accurate for many species of Drosophila, it certainly does not apply to humans or predominantly self-fertilizing model organisms such as C. elegans and Arabidopsis thaliana. Furthermore, several intensively studied Drosophila populations show evidence for recent changes in population size, with bottlenecks and subsequent expansion (Haddrill et al., Reference Haddrill, Charlesworth, Halligan and Andolfatto2005Reference Haddrill, Thornton, Charlesworth and Andolfattob; Ometto et al., Reference Ometto, Glinka, De Lorenzo and Stephan2005), as is also the case for non-African human populations (Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008). There has, therefore, been considerable effort to develop statistical tests for selection that include departures from the assumptions of the standard model, or even avoid them completely (see below).
(b) Positive selection and the McDonald–Kreitman (MK) test
There have been several approaches to testing the neutral model against the alternative hypothesis that many protein sequence variants are under positive selection, causing them to become fixed by selection on a time scale that is much faster than that of genetic drift. Perhaps the most successful has been the MK test (McDonald & Kreitman, Reference McDonald and Kreitman1991) and its extensions. Assume that the same length of sequence is used for both polymorphism and divergence estimates, so that the total numbers of mutable sites are the same in both cases. Under the null hypothesis that all types of sequence variants are neutral, the results quoted just above imply that the expected numbers of synonymous and non-synonymous differences in a coding sequence are proportional to the mutation rates for the two classes of variants: between-species differences and within-species polymorphisms.
Selection on the protein sequence produces a departure from this proportionality, which can be tested for by a simple 2×2 contingency table (but see Andolfatto, Reference Andolfatto2008). If some amino acid substitutions that distinguish a pair of species have been fixed by relatively strong directional (‘positive’) selection, they make little or no contribution to variation within a species, but increase the between-species divergence relative to its neutral expectation. The ratio of non-synonymous to synonymous between-species differences will then be elevated for the between-species comparison relative to the within-species diversity.
The MK test also provides a way of estimating the proportion of fixed differences between species at non-synonymous sites that were caused by positive selection, as opposed to genetic drift – it makes intuitive sense that the larger the ratio used in the MK test, the larger the proportion of non-synonymous differences that distinguish a pair of related species (commonly denoted by α). This intuition can be put into a more precise mathematical framework, and various methods for estimating α from MK tables for sets of loci have been developed (Fay et al., Reference Fay, Wykhoff and Wu2002; Smith & Eyre-Walker, Reference Smith and Eyre-Walker2002; Welch, Reference Welch2006).
Several different multi-locus surveys of DNA sequence variability in Drosophila, combined with estimates of divergence between two closely related species, have consistently suggested α values between 0·25 and 0·70, implying that a sizeable proportion of non-synonymous fixed differences are the result of positive selection (Eyre-Walker, Reference Eyre-Walker2006; Haddrill et al., Reference Haddrill, Loewe and Charlesworth2010). Similarly, data from multiple genome sequences of Escherichia coli and Salmonella typhimurium/enterica suggested an α of about 50% (Charlesworth & Eyre-Walker, Reference Charlesworth and Eyre-Walker2006). In contrast, human polymorphism data and divergence data have yielded little evidence for positive selection by MK-based methods, (e.g. Zhang & Li, Reference Zhang and Li2005). In Drosophila, there is evidence that certain categories of genes, especially those involved in male reproductive functions and immunity, may have unusually high rates of protein sequence evolution and high α values (Baines et al., Reference Baines, Sawyer, Hartl and Parsch2008; Obbard et al., Reference Obbard, Welch, Kim and Jiggins2009). Future large-scale studies of genome-wide variability should lead to more estimates of α for different categories of genes in a wide variety of species (for a recent application to a flowering plant species, see Slotte et al., Reference Slotte, Foxe, Hazzouri and Wright2010).
(c) Testing for selective sweeps
Another widely used approach applies the principle of the hitchhiking effect (Maynard Smith & Haigh, Reference Maynard Smith and Haigh1974), in order to detect evolutionarily recent selective events. A selectively favourable mutation that arises as a unique event, and then spreads to fixation, will cause closely linked variants (present on the chromosome in which it arose) to become fixed along with it, resulting in a ‘selective sweep’ (Berry et al., Reference Berry, Ajioka and Kreitman1991). This leads to a signature of reduced variation at linked neutral sites in a region surrounding the target of selection, provided that the ratio of their frequency of recombination with the site under selection, r, to the selective advantage of the mutation, s, is such that r/s≪1 (Maynard Smith & Haigh, Reference Maynard Smith and Haigh1974), and the sweep is sufficiently recent that variability has not been restored to its equilibrium level under drift and mutation. A sweep also causes a distortion of the SFS at sites near the target of selection, in favour of variants at extreme frequencies.
To detect such effects, a variety of tests have been proposed for evaluating the statistical significance of an observed reduction in variability, and the departure from the expected distribution of neutral variant frequencies in a sample from the population (the so-called site SFS), as well as other statistics such as levels of linkage disequilibrium (e.g. Hudson et al., Reference Hudson, Kreitman and Aguadé1987; Braverman et al., Reference Braverman, Hudson, Kaplan, Langley and Stephan1995; Simonsen et al., Reference Simonsen, Churchill and Aquadro1995; Fay & Wu, Reference Fay and Wu2000; Harr et al., Reference Harr, Kauer and Schloetterer2002; Kim & Stephan, Reference Kim and Stephan2002; Jensen et al., Reference Jensen, Kim, Bauer DuMont, Aquadro and Bustamante2005; Zeng et al., Reference Zeng, Shi and Wu2007; Boitard et al., Reference Boitard, Schlötterer and Futschik2009), or by comparing the typical genome-wide SFS with the SFS for candidates for selective sweeps, e.g. Nielsen et al. (Reference Nielsen, Williamson, Kim, Hubisz, Clark and Bustamante2005). Several of these methods also include means of jointly estimating demographic changes and the locations of selective sweeps. These approaches have been successfully applied to the detection of sweeps, and to determine more or less precisely the location of the target of selection, especially in humans and Drosophila (for overviews, see Williamson et al., Reference Williamson, Hubisz, Clark, Payseur, Bustamante and Nielsen2007; Stephan Reference Stephan2010).
(d) Testing for balancing selection
If a new, selectively favourable mutation does not spread to fixation, but instead is subject to balancing selection, or its selective advantage is restricted to certain local populations, then the haplotype associated with the new variant will initially be present at an intermediate frequency, with a high level of linkage disequilibrium with respect to variants at surrounding sites. Before recombination has had the opportunity to introduce variants from haplotypes that lack the favoured variant, those that carry it will therefore show a low level of genetic diversity at linked sites. As time goes on, recombination whittles this effect away, so the magnitude of reduction in diversity among haplotypes carrying the new variant depends jointly on the time since it originated and the rate of recombination. Several methods have been developed for assessing the statistical significance of a region of ‘extended homozygosity’, associated with the relatively recent spread of a selectively favoured variant to an intermediate frequency; these methods have also been successfully used to locate the targets of selection (e.g. Hudson et al., Reference Hudson, Bailey, Skarecky, Kwiatowski and Ayala1994; Voight et al., Reference Voight, Adams, Frisse, Qian, Hudson and Di Rienzo2005; Sabeti et al., Reference Sabeti, Varrilly, Fry, Lohnmuller, Hostetter, Cotsapas, Xie, Byrne, McCarroll, Gaudet, Schaffner and Lander2007; Coop et al., Reference Coop, Pickrell, Novembre, Kudaravalli, Li, Absher, Myers, Cavalli-Sforza, Feldman and Pritchard2009).
If two variants at a site, A1 and A2, are maintained by balancing selection for a long period of time, substantially greater than their expected coalescence time under neutrality (2N e generations), then a rather different pattern of variability at linked neutral sites is expected to develop. The flow of variants at a neutral site by recombination between chromosomes carrying A1 and A2 is similar to migration between different demes, and takes place at a rate proportional to r, the recombination frequency between the neutral and selected sites. Eventually, drift, mutation and recombination will come into equilibrium, in a way similar to that for a spatially rather than genetically subdivided population. High equilibrium levels of differentiation between the A1 and A2 haplotypes would thus be expected only at closely linked neutral sites (i.e., in the situation equivalent to low migration). This produces a local peak in neutral diversity around the target of selection, which decline over a genetic distance of the order r=1/N e; the SFS for variants close to the target of selection is distorted in favour of intermediate frequency variants (Hudson & Kaplan, Reference Hudson and Kaplan1988; Hudson, Reference Hudson1990; Nordborg, Reference Nordborg1997; Navarro & Barton, Reference Navarro and Barton2002; Barton & Etheridge, Reference Barton and Etheridge2004).
Such signatures of long-term balancing selection can be seen in the classic cases of the mammalian MHC genes (Shiina et al., Reference Shiina, Ota, Shimizu, Katsuyama, Hashimoto, Takasu, Gojobori, Inoko and Bahram2006) and the self-incompatibility (SI) genes of plants (Kamau et al., Reference Kamau, Charlesworth and Charlesworth2007). They are sometimes incorrectly referred to as ‘hitchhiking effects’ (e.g. Shiina et al., Reference Shiina, Ota, Shimizu, Katsuyama, Hashimoto, Takasu, Gojobori, Inoko and Bahram2006), but it should be clear from the above description that no hitchhiking is involved, in the sense of changes in frequencies of neutral variants associated with the selectively driven change in frequency of a linked variant. However, genome-wide scans of human populations suggest that there are few cases of long-term balancing selection of this kind, although some exceptions have been detected that represent less than 1% of genes surveyed (Bubb et al., Reference Bubb, Bovee, Buckley, Haugen, Kibukawa, Paddock, Palmieri, Subramanian, Zhou, Kaul, Green and Olson2006; Andres et al., Reference Andres, Hubisz, Indap, Torgerson, Degenhardt, Boyko, Gutenkunst, White, Green, Bustamante, Clark and Nielsen2009).
(e) Testing for local selection
Somewhat similar principles apply to the detection of differences between populations. Both the recent spread of a mutation that has failed to go to fixation in all local populations of a species (Slatkin & Wiehe, Reference Slatkin and Wiehe1998), and the long-term maintenance of variants that are subject to opposing pressures in different locations (Charlesworth et al., Reference Charlesworth, Nordborg and Charlesworth1997), will produce an excess divergence between populations at linked neutral sites, compared to the genome-wide average. Systematic surveys for effects of this kind have been carried out in human populations, for example, and a large number of candidate genes involved in such selective differentiation between populations have been detected (Akey, Reference Akey2009; Coop et al., Reference Coop, Pickrell, Novembre, Kudaravalli, Li, Absher, Myers, Cavalli-Sforza, Feldman and Pritchard2009; Hancock et al., Reference Hancock, Alkorta-Aranbururu, Witonsky and Di Rienzo2010).
(f) Estimation of the distribution of mutational effects on fitness
A different question about selection involves the nature of the probability distribution of the selection coefficients against new amino-acid mutations. This problem is of considerable importance for numerous issues in evolutionary genetics, including the extent to which selection against deleterious mutations affects evolution at closely linked neutral or weakly selected sites (the process of ‘background selection’: Charlesworth et al., Reference Charlesworth, Morgan and Charlesworth1993), and the evolutionary significance of genetic recombination (Barton, Reference Barton2010).
Two main methods have been developed for estimating the parameters of this distribution. One involves the comparison of the SFSs for non-synonymous variants and putatively neutral variants, such as synonymous or intron variants, fitting these to models that allow for a distribution, ɸ(s), of selection coefficients against heterozygous non-synonymous new mutations. These methods assume distributions with two parameters, such as the normal, log−normal or gamma distributions, and some of them correct for the effects of demographic changes on the SFSs (Piganeau & Eyre-Walker, Reference Piganeau and Eyre-Walker2003; Eyre-Walker et al., Reference Eyre-Walker, Woolfit and Phelps2006; Sawyer et al., Reference Sawyer, Parsch, Zhang and Hartl2007; Keightley & Eyre-Walker, Reference Keightley and Eyre-Walker2007; Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008). The other method relies on a comparison between two species with very different effective population sizes, as indicated by their levels of synonymous variability, which are assumed to be close to neutrality. The extent to which they also differ in their levels of non-synonymous variability reflects the nature of ɸ(s) (Loewe et al., Reference Loewe, Charlesworth, Bartolomé and Nöel2006; Haddrill et al., Reference Haddrill, Loewe and Charlesworth2010).
The results of these studies of both human and Drosophila populations suggest a wide and highly skewed distribution of selection coefficients of s, with most mutations being very weakly selected but with a long tail of much more strongly selected mutations, some even being effectively lethal. The mean of s is hard to estimate, but seems likely to be of the order of a few percent in the case of humans, and possibly an order of magnitude less for Drosophila. The mean selection coefficient against segregating mutations is, however, relatively small, so that their mean N es of the order of 10 or less. A relatively small proportion of new mutations seem to be nearly neutral, with less than 10% having N es<0·5. Mutations more strongly selected than this behave essentially deterministically, as far as their level of nucleotide site diversity is concerned (McVean & Charlesworth, Reference McVean and Charlesworth1999).
Knowledge of ɸ(s) also allows estimation of the overall probability of fixation of a new non-synonymous variant, enabling predictions to be made of the expected amount of non-synonymous divergence between species (Loewe et al., Reference Loewe, Charlesworth, Bartolomé and Nöel2006; Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008; Eyre-Walker & Keightley, Reference Eyre-Walker and Keightley2009), and hence of the value of α, by a different approach from the MK test (see section 3(ii)(b)). Implementation of this method in Drosophila and humans has yielded similar estimates to those from the MK test, with α in the region of 50% for Drosophila, but only 10% in humans (Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008; Eyre-Walker & Keightley, Reference Eyre-Walker and Keightley2009; Haddrill et al., Reference Haddrill, Loewe and Charlesworth2010).
(iii) Selection on non-coding variants
It was assumed for some time by molecular population geneticists that most of the conveniently studied non-coding sequences, such as introns and untranslated transcribed regions (UTRs), would be under weaker selection than even synonymous sites, so that introns, for example, would provide a useful neutral proxy for use in MK tests, and in estimating demographic parameters.
A number of surveys of Drosophila populations using sets of approximately 500 bp intron sequences were conducted for the latter purpose (Haddrill et al., Reference Haddrill, Charlesworth, Halligan and Andolfatto2005Reference Haddrill, Thornton, Charlesworth and Andolfattob; Ometto, et al., Reference Ometto, Glinka, De Lorenzo and Stephan2005; Hutter, et al., Reference Hutter, Li, Beisswanger, De Lorenzo and Stephan2007). It quickly became apparent by comparing their interspecies sequence divergence with that of synonymous sites that such large (for Drosophila) intron sequences are under substantially higher levels of selective constraints than synonymous sites (Haddrill et al., Reference Haddrill, Charlesworth, Halligan and Andolfatto2005a, Reference Haddrill, Thornton, Charlesworth and Andolfattob; Halligan & Keightley, Reference Halligan and Keightley2006).
This has caused attention to be shifted to estimating the extent of positive and purifying selection on non-coding sequences. Many of the tests for selection on non-synonymous mutations can be applied to different types of non-coding sequences, and no major new principles are needed for this purpose. Overall, it seems that sites in short introns (length <100 bp or so) in Drosophila are nearly neutral (Parsch et al., Reference Parsch, Novozhilov, Saminadin-Peter, Wong and Andolfatto2010), whereas sites in long introns are subject to predominantly purifying selection, and UTRs are subject to both purifying and positive selection (Andolfatto, Reference Andolfatto2005; Haddrill et al., Reference Haddrill, Bachtrog and Andolfatto2008). Similar studies of human non-coding sequences also provide evidence for both purifying and positive selection (e.g. The Encode Project Consortium, 2007).
However, when studying evolution and variation in non-coding sequences it is important to take into account the phenomenon of BGC. This refers to an excess over 50% of the frequency of one of the two variants at a heterozygous nucleotide site among the products of meiosis, associated with the formation of heteroduplex DNA (Marais, Reference Marais2003). This is often associated with heterozygosity for a GC base pair and an AT base pair, with a bias in the direction of an excess of the GC variant, in which case the process is sometimes referred to as gBGC (the g refers to the G in GC). The net effect is similar to positive selection (Gutz & Leslie, Reference Gutz and Leslie1976), so that gBGC causes a higher probability of fixation of the GC variant relative to neutrality, and a lower probability of fixation of the AT variant. Selection on GC versus AT variants is thus confounded with the effect of gBGC (Galtier & Duret, Reference Galtier and Duret2007).
(iv) Selection on synonymous variants
The fact that synonymous mutations are under weaker selective constraints than many types of non-coding sequences does not imply that they are completely neutral. Since the 1980s, evidence has accumulated that codon usage bias in many species reflects the action of natural selection, most probably involving translational efficiency or accuracy (Ikemura, Reference Ikemura1982; Sharp & Li, Reference Sharp and Li1986; Drummond & Wilke, Reference Drummond and Wilke2008; Sharp et al., Reference Sharp, Emery and Zeng2010). A variety of methods have been developed for using data on the population frequencies of alternative synonymous variants, which correspond to codons that have been identified from codon usage studies as ‘preferred’ versus ‘non-preferred’ (i.e. codons that are over- versus under-represented in genes with biased codon usage). If large numbers of polymorphic synonymous sites are available, model fits can be applied to estimate the intensity of selection, usually expressed in terms of the product of N e and the selection coefficient in favour of a preferred synonymous variant at a site (Hartl et al., Reference Hartl, Moriyama and Sawyer1994; Akashi, Reference Akashi1995, Reference Akashi1999; Maside et al., Reference Maside, Weishan Lee and Charlesworth2004; Comeron & Guthrie, Reference Comeron and Guthrie2005; Cutter & Charlesworth, Reference Cutter and Charlesworth2006).
The most recently developed methods also include fits to models of population size changes, which can significantly bias estimates of selection if they are ignored (Zeng & Charlesworth, Reference Zeng and Charlesworth2009, Reference Zeng and Charlesworth2010; Zeng, Reference Zeng2010). Conversely, apparent evidence for population expansion that is obtained under the assumption of neutrality at synonymous sites may disappear if selection is taken into account, as in the case of the Zimbabwe population of D. melanogaster (Zeng & Charlesworth, Reference Zeng and Charlesworth2009). In several Drosophila species, evidence for N es values for preferred versus unpreferred codons in the region of 0·5 has been obtained by these methods. Given that the effective sizes of the species concerned are in the millions, selection coefficients of the order of 10−7 to 10−6 are being detected, which of course is far below the resolution of any experimental approach.
With sufficiently large data sets, it has been possible to examine the relationships between Nes estimates obtained in this way, and factors such as coding sequence length and gene expression levels, which are known to be related to codon usage. In Drosophila, these studies have shown that N es is significantly lower in longer genes than shorter ones, in genes with low expression rather than high expression, and in the middle of genes compared to their ends, and on the X chromosome compared with the autosomes (Comeron & Guthrie, Reference Comeron and Guthrie2005; Zeng & Charlesworth, Reference Zeng and Charlesworth2009, Reference Zeng and Charlesworth2010). Evidence is accumulating that selection also acts on synonymous sites in human populations (Comeron, Reference Comeron2006; Kondrashov et al., Reference Kondrashov, Ogurtsov and Kondrashov2006), although factors other than translational efficiency are likely to be involved, notably on selection on mutations affecting exon splice sites and nucleosome positioning (Parmley & Hurst, Reference Parmley and Hurst2007). With the advent of genome-wide resequencing data, it should shortly prove possible to greatly extend these types of analyses.
The possible role of BGC (see section 3(iii) above) in creating apparent selection on non-coding and synonymous sites can also be investigated by population genetics methods applied to large data sets. Since preferred codons in Drosophila and E. coli mostly end in GC, and most synonymous changes involve third coding positions, the effects of gBGC and selection on codon usage on synonymous sites may be confounded to a considerable extent, so that estimates of N es will reflect both forces. By estimating the intensity of apparent selection in favour of GC over AT base pairs at intron sites in the genes that are also used for estimating N es for synonymous sites, the true intensity of selection on synonymous sites can in principle be estimated. Studies of this kind in Drosophila suggest that the equivalent of N es for gBGC (Neω) is on average about one-quarter of the typical values for synonymous sites (Zeng & Charlesworth, Reference Zeng and Charlesworth2010), so that most of the apparent selection on the latter probably reflects selection in favour of preferred codons.
Nevertheless, gBGC appears to be a significant factor in affecting the base composition of the genome, especially at non-coding sites, and may even affect evolution at non-synonymous sites in ‘hotspots’ of unusually high recombination frequencies (Ratnakumar et al., Reference Ratnakumar, Mousset, Glémin, Berglund, Galtier, Duret and Webster2010). In particular, regions of the genome with high GC content seem to have higher N eω values than regions with lower GC content in both humans (Duret & Arndt, Reference Duret and Arndt2008) and Drosophila (Galtier et al., Reference Galtier, Bazin and Bierne2006; Haddrill & Charlesworth, Reference Haddrill and Charlesworth2008). There is, however, a paradox: if N eω is much smaller than N es for preferred codons, why do non-coding sequences in Drosophila often show much larger selective constraints than synonymous sites (see section 3(iii) above)? The answer must lie in selective pressures on the sequences of non-coding sequences that are unrelated to selection or gBGC; a recent analysis of polymorphism data in D. melanogaster has provided evidence for such an effect (Zeng & Charlesworth, Reference Zeng and Charlesworth2010).
4. Conclusions and broader implications
The advent of resequencing studies of whole genomes will greatly increase the amount of data available, and the power of methods of inference concerning the forces involved in DNA sequence variation and evolution. Almost certainly, it will stimulate the development of ever-more sophisticated and computationally demanding methods of data interpretation. It is thus likely that many details of the results described above will be substantially revised in the not-too-distant future. Nonetheless, I am optimistic that their broad outlines will turn out to be roughly correct.
What, then, have we discovered as a result of these efforts, compared say with the state of the field in 1983, when Kimura summed up his views on molecular evolution and variation (Kimura Reference Kimura1983)? One feature that seems to stand out is that the neutral theory is now regarded mainly as a null model, against which alternatives such as selection and BGC can be tested. We now have fairly solid evidence that a substantial fraction of non-synonymous differences between species, and of certain types of non-coding differences, have been caused by positive selection, at least for species with large effective population sizes such as bacteria, Drosophila and some plants, but less so for humans. This may reflect a greater contribution in humans from the fixation by drift of slightly deleterious mutations, due to their low effective population size (Eyre-Walker et al., Reference Eyre-Walker, Keightley, Smith and Gaffney2002), and does not necessarily imply an overall lower rate of fixation per gene of favourable mutations.
We also have some confidence that most new non-synonymous mutations are sufficiently strongly selected against that they have little chance of fixation by drift; nevertheless, the mean selection coefficient against an amino-acid mutation that is segregating in a population is very small, of the order of 10−3 or less in the case of humans. This generates the perhaps startling conclusion that individuals in populations of outbred organisms typically carry large numbers of deleterious amino acid variants that are effectively maintained by a balance between mutation and selection, of the order of 800 for humans (Eyre-Walker et al., Reference Eyre-Walker, Woolfit and Phelps2006; Kryukov et al., Reference Kryukov, Pennachio and Sunyaev2007) and 4500 for Drosophila (Haddrill et al., Reference Haddrill, Loewe and Charlesworth2010). As several authors have pointed out (Pritchard, Reference Pritchard2001; Wright et al., Reference Wright, Charlesworth, Rudan, Carothers and Campbell2003; Eyre-Walker Reference Eyre-Walker2006, Reference Eyre-Walker2010; Kryukov et al., Reference Kryukov, Pennachio and Sunyaev2007), the presence of such a large number of low-frequency deleterious mutations in populations creates a substantial variance in fitness and in the traits that they influence, even if the fitness effects of individual mutations are small. The existence of such a variance has long been inferred from studies of mutational effects, concealed variation and the genetic variance in fitness components of Drosophila (Simmons & Crow, Reference Simmons and Crow1977; Lynch et al., Reference Lynch, Blanchard, Houle, Kibota, Schultz, Vassilieva and Willis1999; Charlesworth & Hughes, Reference Charlesworth, Hughes, Singh and Krimbas2000), but this evidence was largely overlooked by the human genetics community.
This suggests that much human genetic susceptibility to disease may be the effects of rare mutations with minor phenotypic effects. It follows that even very large-scale searches for associations between genetic markers and diseases may uncover only the (possibly small) portion of the total variability caused by common major effect variants. There is indeed increasing evidence that many common diseases with a strong genetic component are caused by larger numbers of low-frequency variants, both non-coding and non-synonymous (Kryukov et al., Reference Kryukov, Pennachio and Sunyaev2007; Eyre-Walker, Reference Eyre-Walker2010). Ironically, therefore, the classical view of the maintenance of genetic variation affecting fitness-related traits (Muller, Reference Muller1950) has been partially vindicated.
The prevalence of so much selection across the genome raises many questions. I will only discuss two. The first is the classic one of how a species with a large genome and a relatively low maximal reproductive rate, such as humans, withstands the resulting very high genetic load arising from the constant input of deleterious mutations, at a rate substantially greater than one new mutation per generation (Muller, Reference Muller1950; Crow, Reference Crow1997). As Alexey Kondrashov once put it, why have we not died 100 times over (Kondrashov, Reference Kondrashov1995)? It remains to be determined how serious this problem actually is, once we get a better idea of how much of the non-coding DNA is under selection. If it is as serious as seems likely to be the case, then Kondrashov's proposal that it can only be resolved by some form of quasi-truncation selection needs to be thoroughly examined. The related question of the long-term consequences of relaxing selection against deleterious mutations by medical intervention against human genetic disease also needs attention (Crow, Reference Crow1997; Lynch, Reference Lynch2010).
The second question concerns the extent to which selection, both purifying and positive, at a multiplicity of sites across the genome has effects on variation and adaptation at nearby sites, as a result of Hill–Robertson (HR) interference (Hill & Robertson, Reference Hill and Robertson1966). Selection creates heritable variance in fitness among individuals, which reduces N e. A site that is linked to a selected variant experiences an especially marked reduction in its N e, because close linkage maintains the effects for many generations (Hill & Robertson, Reference Hill and Robertson1966; Comeron et al., Reference Comeron, Williford and Kliman2008; Barton, Reference Barton2010). In addition to reducing levels of variability, this reduction in N e impairs the efficacy of selection, since the chance of fixation of a mutation depends on the product of N e and its selection coefficient. Both selective sweeps and background selection constitute forms of HR interference, which almost certainly accounts for the correlation between the local recombination rate and the level of silent sequence diversity in Drosophila noted in section 2(ii) (see also Presgraves, Reference Presgraves2005; Shapiro et al., Reference Shapiro, Huang, Zhang, Hubisz, Lu, Turissini, Fang, Wang, Hudson, Nielsen, Chen and Wu2007), since interference is less likely when recombination rates are high. There is increasing evidence for such an effect in other taxa, including humans and Caenorhabditis (Cai et al., Reference Cai, Macpherson, Sella and Petrov2009; Cutter & Choi, Reference Cutter and Choi2010; Rockman et al., Reference Rockman, Skrovanek and Kruglyak2010). Similarly, the sharply reduced level of codon usage and the accelerated rate of protein sequence evolution due to relaxed purifying selection, which are observed in low recombination regions of the Drosophila genome, are consistent with HR interference (Arguello et al., Reference Arguello, Zhang, Kado, Fan, Zhao, Innan, Wang and Long2010; Charlesworth et al., Reference Charlesworth, Betancourt, Kaiser and Gordo2010).
Almost certainly, patterns of evolution and variation in organisms with low effective recombination rates, such as bacteria with their limited rates of genetic exchange among individuals, or highly homozygous selfing species such as budding yeasts and C. elegans, will be subject to strong HR effects, yet this has scarcely been explored in studies of their population genomics. There is also an important question about the effect of HR interference in regions of the genome with ‘normal’ rates of recombination in outbred species. There are empirical indications that such effects exist, such as the negative relation between the rate of protein sequence evolution of a gene and its level of silent site diversity in Drosophila (Sella et al., Reference Sella, Petrov, Przeworski and Andolfatto2009), reduced diversity near coding sequences in humans (Cai et al., Reference Cai, Macpherson, Sella and Petrov2009; McVicker et al., Reference McVicker, Gordon, Davis and Green2009; Hammer et al., Reference Hammer, Woerner, Mendez, Watkins, Cox and Wall2010), and the fact that codon usage bias in D. melanogaster is lower in the middle of genes and in genes that lack introns (Comeron & Kreitman, Reference Comeron and Kreitman2002). The first of these observations is most easily explained by selective sweeps, the others may well involve background selection effects (Loewe & Charlesworth, Reference Loewe and Charlesworth2007; McVicker et al., Reference McVicker, Gordon, Davis and Green2009). In either case, it is seems clear that a full understanding of patterns revealed by genome-level studies will involve the inclusion of the joint effects of selection and linkage, which have so far largely been ignored in the modelling machinery used for inference.