eQTL mapping

doi:10.1017/CBO9781107337459.016

14 - eQTL mapping

from Part III - Single nucleotide polymorphisms, copy number variants, haplotypes and eQTLs

Published online by Cambridge University Press: 18 December 2015

Mengjie Chen ,

Can Yang ,

Cong Li and

Hongyu Zhao

Edited by

Krishnarao Appasani

Foreword by

Stephen W. Scherer and

Peter M. Visscher

Show author details

Mengjie Chen: Affiliation:
Yale University
Can Yang: Affiliation:
Hong Kong Baptist University
Cong Li: Affiliation:
Yale University
Hongyu Zhao: Affiliation:
Yale University
Krishnarao Appasani: Affiliation:
GeneExpression Systems, Inc., Massachusetts
Stephen W. Scherer: Affiliation:
University of Toronto
Peter M. Visscher: Affiliation:
University of Queensland

Book contents

Get access

Summary

Introduction

With an influx of successful genome-wide association studies to identify genetic variations associated with complex diseases, an unprecedented wealth of knowledge has been accumulated for SNP–phenotype associations (McCarthy et al., 2008; Witte 2010; Manolio 2013). However, many SNP–disease associations do not lend themselves to molecular interpretations, because many of the identified loci are located outside of the coding regions. Even when a gene can be inferred to be causal, there is often a significant gap towards the understanding of the underlying molecular mechanisms (Schadt et al., 2005; McCarthy et al., 2008). Genome-wide eQTL mapping has been one effective approach to bridge this gap (Mackay et al., 2009). In eQTL studies, gene expression levels measured by high-throughput technologies, such as microarrays and RNA-Seq, are treated as quantitative traits. Marker genotypes are also collected from the same set of individuals, and statistical analyses are performed to detect associations between markers and expression traits. By simultaneously capturing many regulatory interactions, eQTLs offer valuable insights on the genetic architecture of expression regulation (Rockman and Kruglyak 2006). The ultimate goal of eQTL studies is to elucidate how genetic variations affect phenotypes by using gene expression levels as intermediate molecular phenotypes (Nica and Dermitzakis 2008). In this chapter, we provide an overview of the eQTL analysis workflow (Figure 14.1), introduce publicly available tools for analysis, and further discuss challenges and issues.

Data pre-processing

Genome-wide eQTL mapping considers high-density SNP genotype data and gene expression data from the same individuals in a segregating population. Both require appropriate pre-processing as described below for subsequent analysis.

Genotype data

Three quality control (QC) criteria are often used in the pre-processing of the genotype data. (1) Missing rate: individuals with a large proportion of missing SNP genotypes (e.g., 10%) should be excluded because the DNA samples of those individuals may be of poor quality. SNPs with a large missing rate (e.g., 5%) should also be filtered out. (2) Hardy–Weinberg Equilibrium (HWE): statistically significant deviations from HWE often result from genotyping errors. Therefore, SNPs that fail an exact HWE test (e.g., a P-value less than 0.001) should be filtered out. The criterion does not apply to haploid organisms, such as yeast. (3) Minor allele frequency (MAF): SNPs with low MAF (e.g., 0.05) are sometimes filtered out because of the insufficient statistical power for studies with a relatively small sample size and potentially higher genotype calling error.

Type: Chapter
Information: Genome-Wide Association Studies
From Polymorphism to Personalized Medicine
, pp. 208 - 228

DOI: https://doi.org/10.1017/CBO9781107337459.016 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Ashburner, M., Ball, C.A., Blake, J.A., et al. (2000). Gene ontology: tool for the unification of biology. Nature Genet., 25, 25–29.CrossRef Google Scholar PubMed

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. Ser. B (Method.), 57, 289–300.Google Scholar

Bohnert, R. and Rätsch, G. (2010). rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res., 38, W348–W351.CrossRef Google Scholar PubMed

Bolstad, B.M., Irizarry, R.A., Åstrand, M. and Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193.CrossRef Google Scholar PubMed

Brem, R.B., Yvert, G., Clinton, R. and Kruglyak, L. (2002). Genetic dissection of transcriptional regulation in budding yeast. Science, 296, 752–755.CrossRef Google Scholar PubMed

Broman, K.W., Wu, H., Sen, Ś. and Churchill, G.A. (2003). R/QTL: QTL mapping in experimental crosses. Bioinformatics, 19, 889–890.CrossRef Google Scholar PubMed

Browning, S.R. and Browning, B.L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet., 81, 1084–1097.CrossRef Google Scholar PubMed

Cai, T.T., Li, H., Liu, W. and Xie, J. (2013). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100, 139–156.CrossRef Google Scholar PubMed

Carey, V.J. (2013). GGtools: Genetics of Gene Expression with Bioconductor. R package version 4.6.2.

Chen, L.S., Sangurdekar, D.P. and Storey, J.D. (2011). trigger: Transcriptional Regulatory Inference from Genetics of Gene ExpRession. R package version 1.4.0.

Chen, M., Ren, Z., Zhao, H. and Zhou, H. (2015). Asymptotic normal estimation of covariate-adjusted gaussian graphical model. J. Am. Stat. Ass. Theory Meth. (in press).

Da Huang, W., Sherman, B.T. and Lempicki, R. A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4, 44–57.Google Scholar

Delaneau, O., Zagury, J.-F. and Marchini, J. (2012). Improved whole-chromosome phasing for disease and population genetic studies. Nature Meth., 10, 5–6.Google Scholar

Dillies, M.-A., Rau, A., Aubert, J., et al. (2013). A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Brief. Bioinform., 14, 671–683.CrossRef Google Scholar PubMed

Dunning, M.J., Smith, M.L., Ritchie, M.E. and Tavaré, S. (2007). beadarray: R classes and methods for Illumina bead-based data. Bioinformatics, 23, 2183–2184.CrossRef Google Scholar PubMed

Eden, E., Navon, R., Steinfeld, I., Lipson, D. and Yakhini, Z. (2009). GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinform., 10, 48.CrossRef Google Scholar PubMed

Fusi, N., Stegle, O. and Lawrence, N.D. (2012). Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput. Biol., 8, e1002330.CrossRef Google Scholar PubMed

Gagnon-Bartsch, J.A. and Speed, T.P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13, 539–552.CrossRef Google Scholar PubMed

Gautier, L., Cope, L., Bolstad, B.M. and Irizarry, R.A. (2004). affy – analysis of Afymetrix GeneChip data at the probe level. Bioinformatics, 20, 307–315.CrossRef Google Scholar

Guttman, M., Garber, M., Levin, J.Z., et al. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nature Biotechnol., 28, 503–510.Google Scholar PubMed

Haley, C.S., Knott, S.A. and Elsen, J. (1994). Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics, 136, 1195–1207.Google Scholar PubMed

Hamel, L.-P., Nicole, M.-C., Duplessis, S. and Ellis, B.E. (2012). Mitogen-activated protein kinase signaling in plant-interacting fungi: distinct messages from conserved messengers. Plant Cell Online, 24, 1327–1351.CrossRef Google Scholar PubMed

Johnson, W.E., Li, C. and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8, 118–127.CrossRef Google Scholar PubMed

Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28, 27–30.CrossRef Google Scholar PubMed

Kang, H.M., Ye, C. and Eskin, E. (2008). Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics, 180, 1909–1925.CrossRef Google Scholar PubMed

Katz, Y., Wang, E.T., Airoldi, E.M. and Burge, C.B. (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nature Meth., 7, 1009–1015.CrossRef Google Scholar PubMed

Kim, D., Pertea, G., Trapnell, C., et al. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 14, R36.CrossRef Google Scholar PubMed

Kim, S. and Xing, E.P. (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann. Appl. Stat., 6, 1095–1117.CrossRef Google Scholar

Lander, E.S. and Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics, 121, 185–199.Google Scholar PubMed

Lee, S., Zhu, J. and Xing, E.P. (2010). Adaptive multi-task lasso: with application to eQTL detection. In Advances in neural information processing systems, pp. 1306–1314.

Leek, J.T., Scharpf, R.B., Bravo, H.C., et al. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet., 11, 733–739.CrossRef Google Scholar PubMed

Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. and Dewey, C.N. (2010a). RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26, 493–500.CrossRef Google Scholar PubMed

Li, B., Chun, H. and Zhao, H. (2012b). Sparse estimation of conditional graphical models with application to gene networks. J. Am. Statist. Ass., 107, 152–167.CrossRef Google Scholar PubMed

Li, C. and Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol., 2, 1–11.Google Scholar PubMed

Li, J.J., Jiang, C.-R., Brown, J.B., Huang, H. and Bickel, P.J. (2011). Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl Acad. Sci. USA, 108, 19867–19872.CrossRef Google Scholar PubMed

Li, L., Zhang, X. and Zhao, H. (2012b). eQTL. In Quantitative Trait Loci (QTL). Springer, pp. 265–279.Google Scholar

Li, Y., Álvarez, O.A., Gutteling, E.W., et al. (2006). Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genet., 2, e222.CrossRef Google Scholar PubMed

Li, Y., Willer, C.J., Ding, J., Scheet, P. and Abecasis, G.R. (2010b). MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol., 34, 816–834.CrossRef Google Scholar PubMed

Listgarten, J., Kadie, C., Schadt, E.E. and Heckerman, D. (2010). Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl Acad. Sci. USA, 107, 16465–16470.CrossRef Google Scholar PubMed

Mackay, T.F., Stone, E.A. and Ayroles, J.F. (2009). The genetics of quantitative traits: challenges and prospects. Nature Rev. Genet., 10, 565–577.CrossRef Google Scholar PubMed

Manolio, T.A. (2013). Bringing genome-wide association findings into clinical use. Nature Rev. Genet., 14, 549–558.CrossRef Google Scholar PubMed

McCarthy, M.I., Abecasis, G.R., Cardon, L.R., et al. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet., 9, 356–369.CrossRef Google Scholar PubMed

Michaelson, J.J., Loguercio, S. and Beyer, A. (2009). Detection and interpretation of expression quantitative trait loci (eQTL). Methods, 48, 265–276.CrossRef Google Scholar

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Meth., 5, 621–628.CrossRef Google Scholar PubMed

Nica, A.C. and Dermitzakis, E.T. (2008). Using gene expression to investigate the genetic basis of complex disorders. Hum. Molec. Genet., 17, R129–R134.CrossRef Google Scholar PubMed

Obozinski, G., Wainwright, M.J. and Jordan, M.I. (2011). Support union recovery in high-dimensional multivariate regression. Ann. Statist., 39, 1–47.CrossRef Google Scholar

Pastinen, T., Ge, B. and Hudson, T.J. (2006). Influence of human genome polymorphism on gene expression. Hum. Molec. Genet., 15, R9–R16.CrossRef Google Scholar PubMed

Price, A.L., Patterson, N.J., Plenge, R.M., et al. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet., 38, 904–909.CrossRef Google Scholar PubMed

Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575.CrossRef Google Scholar PubMed

Richard, H., Schulz, M.H., Sultan, M., et al. (2010). Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res., 38, e112–e112.CrossRef Google Scholar PubMed

Robertson, G., Schein, J., Chiu, R., et al. (2010). De novo assembly and analysis of RNA-Seq data. Nature Meth., 7, 909–912.CrossRef Google Scholar PubMed

Rockman, M.V. and Kruglyak, L. (2006). Genetics of global gene expression. Nature Rev. Genet., 7, 862–872.CrossRef Google Scholar PubMed

Salzman, J., Jiang, H. and Wong, W.H. (2011). Statistical modeling of RNA-Seq data. Statist. Sci., 26, 62–83.CrossRef Google Scholar PubMed

Schadt, E.E., Lamb, J., Yang, X., et al. (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genet., 37, 710–717.CrossRef Google Scholar PubMed

Shabalin, A.A. (2012). Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics, 28, 1353–1358.CrossRef Google Scholar PubMed

Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2013). A sparse-group lasso. J. Comput. Graph. Statist., 22, 231–245.CrossRef Google Scholar

Smyth, G.K. (2005). Limma: linear models for microarray data. In Gentleman, R., Carey, V., Dudoit, S., Irizarry, R. and Huber, W. (Eds.), Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York, NY, pp. 397–420.Google Scholar

Stojmirović, A. and Yu, Y.-K. (2009). ITM probe: analyzing information flow in protein networks. Bioinformatics, 25, 2447–2449.CrossRef Google Scholar PubMed

Stojmirović, A. and Yu, Y.-K. (2012). Information flow in interaction networks II: channels, path lengths, and potentials. J. Comput. Biol., 19, 379–403.CrossRef Google Scholar PubMed

Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika, 99, 879–898.CrossRef Google Scholar

Sun, W. and Hu, Y. (2013). eQTL mapping using RNA-seq data. Statist. Biosci., 5, 198–219.CrossRef Google Scholar PubMed

Suthram, S., Beyer, A., Karp, R.M., Eldar, Y. and Ideker, T. (2008). eQED: an efficient method for interpreting eQTL associations using protein networks. Molec. Syst. Biol., 4, 162.CrossRef Google Scholar PubMed

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. Ser. B (Method.), 58, 267–288.Google Scholar

Trapnell, C.,Williams, B.A., Pertea, G., et al. (2010). Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol., 28, 511–515.CrossRef Google Scholar PubMed

Tu, Z., Wang, L., Arbeitman, M.N., Chen, T. and Sun, F. (2006). An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics, 22, e489–e496.CrossRef Google Scholar PubMed

Verbeke, L.P., Cloots, L., Demeester, P., Fostier, J. and Marchal, K. (2013). Epsilon: an eQTL prioritization framework using similarity measures derived from local networks. Bioinformatics, 29, 1308–1316.CrossRef Google Scholar PubMed

Voevodski, K., Teng, S.-H. and Xia, Y. (2009). Spectral affinity in protein networks. BMC Syst. Biol., 3, 112.CrossRef Google Scholar PubMed

Wang, X., Qin, L., Zhang, H., et al. (2015). A regularized multivariate regression approach for eQTL analysis. Statist. Biosci., 7, 129–146.CrossRef Google Scholar PubMed

Witte, J.S. (2010). Genome-wide association studies and beyond. Annu. Rev. Publ. Health, 31, 9–20.CrossRef Google Scholar PubMed

Xia, Z., Wen, J., Chang, C.-C. and Zhou, X. (2011). NSMAP: A method for spliced isoforms identification and quantification from RNA-Seq. BMC Bioinform., 12, 162.CrossRef Google Scholar PubMed

Yang, C., Wang, L., Zhang, S. and Zhao, H. (2013). Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping. Bioinformatics, 29, 1026–1034.CrossRef Google Scholar PubMed

Yin, J. and Li, H. (2011). A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Statist., 5, 2630.CrossRef Google Scholar PubMed

Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829.CrossRef Google Scholar

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Statist. Soc. Ser. B (Statist. Method.), 67, 301–320.Google Scholar

Zou, W., Aylor, D.L. and Zeng, Z.-B. (2007). eQTL viewer: visualizing how sequence variation affects genome-wide transcription. BMC Bioinform., 8, 7.CrossRef Google Scholar PubMed