1. Introduction
The diminishing cost of high-throughput technologies such as whole genome transcript profiling, high-density genotyping and whole genome re-sequencing has shifted the focus of genomic sciences from data production to data interpretation. Foremost among the challenges in interpretation is functional gene annotation, through experimental validation or computational prediction. Even for the best-annotated genomes, a significant proportion of genes are yet to be functionally characterized (Peña-Castillo & Hughes, Reference Peña-Castillo and Hughes2007; Costello et al., Reference Costello, Dalkilic, Beason, Gehlhausen, Patwardhan, Middha, Eads and Andrews2009); less than half in Drosophila (Costello et al., Reference Costello, Dalkilic, Beason, Gehlhausen, Patwardhan, Middha, Eads and Andrews2009).
Most knowledge regarding gene function in eukaryotes comes from mutagenesis, single-gene knock-outs and RNA interference (RNAi) knock-down experiments performed in yeasts, Drosophila, Caenorhabditis elegans, mouse and Arabidopsis (Winzeler et al., Reference Winzeler, Shoemaker, Astromoff, Liang, Anderson, Andre, Bangham, Benito, Boeke, Bussey, Chu, Connelly, Davis, Dietrich, Dow, El Bakkoury, Foury, Friend, Gentalen, Giaever, Hegemann, Jones, Laub, Liao, Liebundguth, Lockhart, Lucau-Danila, Lussier, M'Rabet, Menard, Mittmann, Pai, Rebischung, Revuelta, Riles, Roberts, Ross-MacDonald, Scherens, Snyder, Sookhai-Mahadeo, Storms, Véronneau, Voet, Volckaert, Ward, Wysocki, Yen, Yu, Zimmermann, Philippsen, Johnston and Davis1999; Alonso et al., Reference Alonso, Stepanova, Leisse, Kim, Chen, Shinn, Stevenson, Zimmerman, Barajas, Cheuk, Gadrinab, Heller, Jeske, Koesema, Meyers, Parker, Prednis, Ansari, Choy, Deen, Geralt, Hazari, Hom, Karnes, Mulholland, Ndubaku, Schmidt, Guzman, Aguilar-Henonin, Schmid, Weigel, Carter, Marchand, Risseeuw, Brogden, Zeko, Crosby, Berry and Ecker2003; Kamath & Ahringer, Reference Kamath and Ahringer2003; Bellen et al., Reference Bellen, Levis, Liao, He, Carlson, Tsang, Evans-Holm, Hiesinger, Schulze, Rubin, Hoskins and Spradling2004; Dietzl et al., Reference Dietzl, Chen, Schnorrer, Su, Barinova, Fellner, Gasser, Kinsey, Oppel, Scheiblauer, Couto, Marra, Keleman and Dickson2007; Ni et al., Reference Ni, Liu, Binari, Hardy, Shim, Cavallaro, Booker, Pfeiffer, Markstein, Wang, Villalta, Laverty, Perkins and Perrimon2009; Guan et al., Reference Guan, Ye, Yang and Gao2010; Spirek et al., Reference Spirek, Benko, Carnecka, Rumpf, Cipak, Batova, Marova, Nam, Kim, Park, Hayles, Hoe, Nurse and Gregan2010). These approaches have provided functions for a large number of genes in many organisms and the basis for making gene function predictions based on gene sequence similarities. However, screening large mutant collections for quantitative phenotypes is highly laborious. Furthermore, unique mutations in the same gene, or the same mutation in multiple genetic backgrounds can give different phenotypes, further complicating the interpretation of such screens (Flint & Mackay, Reference Flint and Mackay2009; Mackay et al., Reference Mackay, Stone and Ayroles2009; Dowell et al., Reference Dowell, Ryan, Jansen, Cheung, Agarwala, Danford, Bernstein, Rolfe, Heisler, Chin, Nislow, Giaever, Phillips, Fink, Gifford and Boone2010).
Computational methods for gene annotation complement experimental approaches. Computational methods rely on the detection of particular sequence motifs (e.g. a binding domain) (Hrmova & Fincher, Reference Hrmova and Fincher2009); strong orthology with a gene of known function in a closely related species; or ‘guilt-by-association’ (Bréhélin et al., Reference Bréhélin, Florent, Gascuel and Maréchal2010). The last approach is based on correlative evidence, such as the co-regulation of gene expression or the existence of known protein–protein interactions. In all cases, the functional annotation of a known gene is transferred to its interacting or correlated partner, providing a hypothesis that can be verified experimentally.
Traditionally, guilt-by-association annotation has been used in the context of environmental perturbations (Walker et al., Reference Walker, Volkmuth, Sprinzak, Hodgson and Klingler1999; Reverter et al., Reference Reverter, Ingham and Dalrymple2008; Vandepoele et al., Reference Vandepoele, Quimbaya, Casneuf, De Veylder and Van de Peer2009; Klie et al., Reference Klie, Nikoloski and Selbig2010). A complementary approach is to utilize natural variation in genetically correlated transcriptional networks to identify co-regulated transcripts. Previously, we used genome wide transcript profiles from 40 lines from the Drosophila Genetic Reference Panel (DGRP; Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), a set of inbred lines recently derived from the wild, as a source of genetic variation in gene expression. The genetic variation among these inbred lines greatly exceeds that which can be obtained by mutagenesis screens or standard genetic crosses, while sampling multiple genetically identical individuals from each line reduces environmental variance. The genetically variable transcripts are highly correlated among the lines, forming 241 transcriptional co-expression modules (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009). These co-expression modules were enriched for common Gene Ontology (GO) categories, expression in the same tissues, common transcriptional factor binding sites and associations of gene expression with the same quantitative traits. These observations suggest that genetic correlation of gene expression with a co-expression module may be due to co-regulation and that transcripts genetically correlated with a target gene of known function are plausibly involved in the same biological process or molecular function as the target gene (Luo et al., Reference Luo, Yang, Zhong, Gao, Khan, Thompson and Zhou2007; Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009). Here, we test this hypothesis using seminal fluid proteins (SFPs) as the focal genes.
We chose SFPs as focal genes for two reasons. First, many of the gene products of the secretory tissues of the male reproductive tract that produce the SFPs are well understood in Drosophila melanogaster (Wolfner, Reference Wolfner2009). This is especially true for the male accessory glands (AGs), which produce proteins collectively known as ACcessory gland Proteins (ACPs). ACPs are transferred to females in the seminal fluid and affect a number of post-mating processes (Wolfner, Reference Wolfner2009), including sperm storage and maintenance (Neubaum & Wolfner, Reference Neubaum and Wolfner1999; Tram & Wolfner, Reference Tram and Wolfner1999; Ravi Ram & Wolfner, Reference Ravi Ram and Wolfner2007, Reference Ravi Ram and Wolfner2009), egg production and mating receptivity (Heifetz et al., Reference Heifetz, Lung, Frongillo and Wolfner2000; Chapman et al., Reference Chapman, Bangham, Vinti, Seifried, Lung, Wolfner, Smith and Partridge2003; Liu and Kubli, Reference Liu and Kubli2003), female feeding behaviour (Carvalho et al., Reference Carvalho, Kapahi, Anderson and Benzer2006) and sleep patterns (Isaac et al., Reference Isaac, Li, Leedale and Shirras2010). Proteomic (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008, Reference Findlay, MacCoss and Swanson2009) and gene expression (Swanson et al., Reference Swanson, Clark, Waldrip-Dail, Wolfner and Aquadro2001) studies have identified 187 SFPs, most of which are ACPs. Second, we observed strong genetic correlations in expression among the known ACPs (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), suggesting that new SFPs, and potentially genes important for the production or function of these proteins, could be found by analysing the correlation structure between genetically variable transcripts.
Using the DGRP expression data (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), we identified transcripts whose expression patterns correlated with known SFPs. These correlated transcripts are candidates for both previously unknown SFPs and genes that are required for regulation of SFP production. Very little is known about how SFP genes are regulated in the male; this method provides a means to identify candidate regulatory genes for further study. As a proof of principle, the only known transcription factor required for the expression of specific SFP genes (Xue & Noll, Reference Xue and Noll2002) was among the candidate genes we identified. Although proteins encoded by regulatory genes would not necessarily be transferred to females during mating, and are therefore not SFPs per se, we refer to our set of candidate SFPs as cSFPs.
We identified 176 cSFP genes. For validation, we selected seven candidates with varying levels of correlation to known SFP genes and used quantitative real-time PCR (qRT-PCR) to validate the correlation patterns. We also used RT-PCR to test the tissue of expression for these seven genes. We propose that this method can be widely applied to similar datasets, beyond the example of the SFP functional annotation we present.
2. Methods
(i) Gene expression data
The gene expression data are from Ayroles et al. (Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009) . Whole genome expression was quantified using Affymetrix Drosophila 2.0 arrays for two replicate pools of 3–5-day-old mated males and females for each of 40 DGRP lines. We median-centred the perfect match (PM) data and removed probes that were identified as likely single feature polymorphisms. We used the median log2 signal intensity of the remaining PM probes in each probe set as the measure of expression. A total of 14 840 (78·9%) of the 18 767 transcripts on the array were expressed. Because we focus here on highly male-biased transcripts, we only used the male gene expression data to identify genetically variable transcripts. We fitted the following model to the expression data: Y=L+e, where Y is the median log2 signal intensity, L is the line effect and e is the residual. We identified 7151 transcripts as genetically variable at a False Discovery Rate (FDR)<0·01.
The raw microarray data are deposited in the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MEXP-1594. The DGRP stocks are available from the Bloomington Drosophila Stock Center (Bloomington, Indiana).
(ii) cSFPs
Of the 187 known SFP genes, 107 had genetically variable expression levels in the DGRP lines. We computed pairwise Pearson correlations between the 107 genetically variable SFPs and all 7151 genetically variable transcripts, 1. We then calculated an ‘SFP score’ for each of the 7151 transcripts by tallying the number of significant correlations (P<0·01) with known SFPs, divided by 107. For a given transcript, a score of 100 indicates that it is correlated with all 107 known SFPs, and a score of 0·93 (1/107×100) indicates the absence of significant correlation between the focal gene and any of the known SFP genes (i.e. only showing correlation to itself). The thresholds used to compute this score are arbitrary, but this method is both simple and intuitive, and gives similar results to more sophisticated statistics such as the identification of eigengenes (Langfelder & Horvath, Reference Langfelder and Horvath2007) following the construction of co-expression gene networks and using the Principal Component Analysis (PCA) loadings to identify correlated transcripts.
In addition to the correlation structure, we used several criteria to identify transcripts as putative SFPs (proteins that are predominately or exclusively expressed in the male reproductive tract and likely to be transferred to females), or as potential regulatory genes (those that produce proteins unlikely to be transferred to females) but whose expression is also predominately limited to male reproductive tissues. We used FlyAtlas (Chintapalli et al., Reference Chintapalli, Wang and Dow2007), a database of tissue-specific expression for D. melanogaster, to examine the tissues of expression for each gene with an SFP score of greater than 8. In addition, because SFPs are secreted proteins, we used SignalP software (http://www.cbs.dtu.dk/services/SignalP/) to identify the presence of predicted signal sequences. The program calculates the probability that the input amino acid sequence contains an N-terminal secretion signal. Here, we used the signal peptide probability score given from the SignalP-HMM prediction method. Signal peptides are usually 15–30 amino acids long and contain a stereotypical pattern of charged, hydrophobic and uncharged residues, although the amino acid sequence itself is not conserved (Emanuelsson et al., Reference Findlay, MacCoss and Swanson2007). However, not all secreted proteins contain predicted signal sequences (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008), and not all proteins with secretion signals are secreted (Emanuelsson et al., Reference Emanuelsson, Brunak, von Heijne and Nielsen2007). Therefore, we do not exclude genes as being SFPs or ACP candidates based solely on a low SignalP score.
(iii) Experimental validation of cSFPs
We chose seven genes identified as cSFPs for validation of the guilt-by-association results as well as further characterization. These genes have a range of SFP scores and a few have predicted biochemical functions, though none were predicted to be involved with SFP function. In addition to the seven candidates, we also included a known ACP gene (CG9997; Swanson et al., Reference Swanson, Clark, Waldrip-Dail, Wolfner and Aquadro2001; Ravi Ram & Wolfner, Reference Ravi Ram and Wolfner2007), and a known ejaculatory duct (ED) protein gene (Dup99B; Saudan et al., Reference Saudan, Hauck, Soller, Choffat, Ottiger, Spörri, Ding, Hess, Gehrig, Klauser, Hunziker and Kubli2002), both of whose products are transferred to females, as positive controls. We expect cSFP genes, including those expressed in the ED or bulb, to correlate in expression with the known SFP, CG9997. We included CG34422 as a negative control, given its low SFP score and wide expression pattern across tissues, including the male AGs, brain, eye and hindgut. This gene should not show a significant correlation to CG9997 in the qRT-PCR experiment, in contrast to the seven cSFPs.
We independently validated the tissue-biased expression results from FlyAtlas (Chintapalli et al., Reference Chintapalli, Wang and Dow2007) for these 10 genes. We reared Canton-S males on standard yeast-glucose medium under uncrowded conditions at ~24°C. We dissected 50–60 testes (T), AGs, EDs, ejaculatory bulbs (EB) and male carcasses (C; no reproductive tract). Dissected tissues were placed directly into TRIzol Reagent (Invitrogen) on ice. We collected two biological replicates for each RNA extraction.
We used qRT-PCR to validate the correlation structure between the genes that had been inferred from the microarray experiment. We randomly selected 20 of the 40 DGRP lines used in the microarray study (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), and isolated total RNA from two biological replicates, each with 8–12 males of each line (3–7 days post-eclosion). We then estimated the correlation of gene expression with the known SFP, CG9997.
(iv) RNA extractions and cDNA synthesis
We extracted total RNA by grinding dissected tissues in 150 μl of TRIzol Reagent (Invitrogen), following the manufacturer's recommendations for RNA isolation, except that 0·5 ml of chloroform was used for every 1 ml of TRIzol. Total RNA was treated with DNase1 (Invitrogen) and converted to cDNA with Superscript II Reverse Transcriptase (Invitrogen) and oligo-dT primers as recommended by the manufacturer. We used 500 ng of total RNA per 20 μl reverse transcription reaction. Negative controls without reverse transcriptase were tested once for all genes and all cDNA samples to exclude potential genomic DNA contamination.
v qRT-PCR
We quantified mRNA levels by qRT-PCR in 25 μl reactions with the SYBR green detection method (iQ SYBR Green Supermix, Bio-Rad) according to the protocol from MyiQ Single-Color Real-Time PCR Detection System (Bio-Rad). Each reaction was performed with 2 pg of total cDNA, using a BioRadMyiQ Single-Color Real-Time PCR Detection system. We used the actin5C gene as an internal standard. We used Primer3 (http://frodo.wi.mit.edu/primer3/) to design transcript-specific primers to amplify 85–148 bp regions of the genes of interest. CG34422 primers were designed to encompass the common regions of alternative transcripts. The starting template concentration of each transcript was calculated from the standard curve of that primer pair according to the method described by Qiagen (http://www1.qiagen.com/literature/brochures/pcr/qt/1037490_ag_pcr_0206_int_lr.pdf). We used the linear regression model Y=mX+b to quantify transcript abundance, where Y is the critical threshold (Ct) values from the qRT-PCR experiment, m is the slope, b is the intercept of the standard curve and X is the transcript abundance. We standardized this estimate by dividing by the transcript abundance of actin5C in the same sample.
vi GO analysis
We used the GO analysis to assign functional categories to the cSFP genes tested. We computed the genetic correlations between each of the seven new focal genes with the remainder of the genetically variable transcriptome. We then performed a GO enrichment analysis for the genes most strongly correlated to the focal gene (P<0·001 and |r|>0·5). The conclusions regarding enrichment were the same if the threshold was increased to P<0·0001. We performed this analysis using DAVID 6.7 (Huang et al., Reference Huang, Sherman and Lempicki2009).
3. Results and Discussion
Of the 187 known SFPs, 107 had genetically variable transcripts among the 40 DGRP lines (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009). The 107 known SFPs were highly genetically correlated (Fig. 1), reinforcing the idea that gene co-expression may be a reflection of shared function. We attempted to cluster this correlation matrix further into modules using various clustering algorithms, including Modulated Modularity Clustering (MMC) (Stone & Ayroles, Reference Stone and Ayroles2009), but did not find strong community structure in the graph resulting from this correlation matrix. In addition, we did not find evidence supporting the idea that genes sharing a similar GO term were more strongly correlated with each other than they were to the rest of the genes.
We then analysed the correlation matrix between the 107 known SFPs and 7151 transcripts that were genetically variable in males. We assigned an SFP score to each of the genetically variable transcripts based on the number of significant correlations with known SFPs (Supplementary Table 1 available at http://cambridge.journals.org/GRH). We next asked whether this approach would allow us to recover the known SFPs. We ranked the vector of SFP scores from the highest to the lowest and applied the filter that cSFPs should be expressed in male reproductive tissues based on FlyAtlas (Chintapalli et al., Reference Chintapalli, Wang and Dow2007) data. We found that 78% of the known SFPs are in the top 500 transcripts.
We identified 176 cSFP genes that have correlated expression patterns to at least 7 of the 107 genetically variable known seminal protein genes and are expressed in male reproductive tissues (Supplementary Table 1). A total of 37 of the 176 candidates have no known or predicted functions or GO terms. An additional 13 transcripts correspond to probe sets on the Affymetrix array but not annotated genes, and could correspond to new genes. Independent confirmation of cSFP identification comes from a proteomic screen aimed at identifying male proteins transferred during mating (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008, Reference Findlay, MacCoss and Swanson2009). Two candidate transcripts were confirmed as bona fide SFPs: CG34002 (with an SFP score of 15) and Sfp26Ad (with an SFP score of 41). Sfp26Ad was not annotated as a gene at the time we performed this experiment and corresponded to probe set 637742 at on the Affymetrix array.
We chose seven cSFP genes (CG9720, CG11828, CG31413, CG31493, CG31496, CG32985 and CG34002), as well as two positive control genes (the ACP gene CG9997 and the ED protein gene Dup99B) and one negative control gene (CG34422, with an SFP score of 0·93) for validation of the microarray correlation results using qRT-PCR in 20 of the DGRP lines. The candidate genes have SFP scores ranging from moderately low (8) to very high (42, the highest SFP score found) (Table 1). The RT-PCR results confirmed the correlation between all seven cSFPs and the known ACP gene CG9997 across the 20 lines (Fig. 2). As predicted, expression of the negative control CG34422 was not genetically correlated with that of CG9997. However, expression of the ED protein gene Dup99B, whose gene product is transferred with the seminal fluid to females, was genetically correlated with CG9997, demonstrating that non-ACP SFPs canals to be identified with this method.
Table 1 gives SFP scores, secretion signal peptide probability and tissue of expression for these seven genes and for the positive and negative controls. Three genes with high SFP scores were not predicted to have secretion signals. These genes’ products may be secreted nevertheless, as has been seen in other cases (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008), or they may be non-SFP genes that are important for the regulation of other SFPs.
Among the seven genes, all that were predicted to be expressed in AGs (Chintapalli et al., Reference Chintapalli, Wang and Dow2007) were confirmed as expressed in that tissue (Table 1, Fig. 3). To gain insight into the possible biological processes and molecular functions of the candidate genes chosen for validation, we used a GO enrichment analysis implemented in DAVID (6.7) (Huang et al., Reference Huang, Sherman and Lempicki2009). For each candidate gene, we analysed the function of its most correlated transcripts (P<0·001 and r>0·5). Four of the seven candidate genes (CG11828, CG31413, CG31493 and CG34002) were significantly associated with serine-type endopeptidase inhibitor activity, a predicted function shared by several other SFPs (Wolfner, Reference Wolfner2009). However, it is important to note that CG11828, CG31413 and CG31493 do not contain conserved protease domains but do contain other types of predicted conserved domains. No significant GO-class enrichment was observed for CG9720, CG31496 or CG32985.
It is possible that some of the cSFP genes are important for SFP expression and function but may not encode proteins that are transferred to females as part of the seminal fluid. As proof of principle that such genes can be identified by this method, our analysis detected paired (SFP score=16), which encodes a transcription factor important in AG development and ACP expression (Supplementary Table 2 available online at http://cambridge.journals.org/GRH). This Pax gene has a dual function in Drosophila: it acts first as a pair-rule gene in early embryo development (Nüsslein-Volhard & Weischaus, Reference Nüsslein-Volhard and Wieschaus1980; Kilchherr et al., Reference Kilchherr, Schumaker, Phillips and Curtiss1986) and later is required for viability and male fertility (Bertuccioli et al., Reference Bertuccioli, Fasano, Jun, Wang, Sheng and Desplan1996; Xue & Noll, Reference Xue and Noll1996, Reference Xue and Noll2000). AG formation and expression of at least two SFPs expressed in the AG (ACP26Aa and SP) both require the function of paired (Xue & Noll, Reference Xue and Noll2000, Reference Xue and Noll2002).
Guilt-by-association methods most frequently rely on clustering algorithms to identify the functional membership of a candidate gene or transcript (Aravind, Reference Aravind2000; Miozzi et al., Reference Miozzi, Piro, Rosa, Ala, Silengo, Di Cunto and Provero2008; Reverter et al., Reference Reverter, Ingham and Dalrymple2008; Klie et al., Reference Klie, Nikoloski and Selbig2010). In its most common use, guilt-by-association is used to assign functions to any or all unannotated genes that respond to a given treatment or are differentially regulated under disease conditions. Here, we have demonstrated the use of guilt-by-association methods in another context: to identify genes in a specific functional class using correlated genetic variation in gene expression among wild-derived inbred lines. This method removes the requirement for relying on arbitrary clustering or reliance on GO terms to assign candidate functions to new genes. Instead, a group of genes that has been annotated and functionally clustered experimentally is used to find correlated transcripts that can then be included in the group. In this case, we used SFPs, a group defined by a biological phenomenon rather than a biochemical function. As in potentially many other cases, for example, identifying genes involved in specific behaviours, GO terms do not define our selected group of genes as belonging to a biologically significant group. The group of genes we identified (cSFPs) have diverse GO functions (ranging from proteases to pro-hormones). A given cSFP gene could not be predicted as an SFP on the basis of GO membership.
SFP genes are well suited for this study since their expression is specific to, or highly biased in, the male reproductive tract, facilitating their confirmation as SFPs; and expression of the known SFPs is genetically variable in the population of lines surveyed. An increasing number of studies are taking advantage of natural genetic variation to better understand the genetic basis of phenotypic variation (Mackay et al., Reference Mackay, Stone and Ayroles2009). In the future, the availability of sequence information for the D. melanogaster population used in this study will allow us to associate co-expression with expression quantitative trait locus (eQTL) analysis (Mackay et al., Reference Mackay, Stone and Ayroles2009). This additional layer of information will further our understanding of what genetic factors are driving co-expression between SFP genes, and may lead us to rethink what information should be considered when annotating a segment of sequence.
To complement this study, and generalize the simple analysis presented in this manuscript, we have created a web tool (http://dgrp.statgen.ncsu.edu) that allows the user to input the Affymetrix Drosophila 2.0 ID of any focal gene of interest and retrieve a vector of genes, their ranked correlation with the focal gene, as well as the GO of the correlated transcripts. This tool integrates FlyAtlas information (Chintapalli et al., Reference Chintapalli, Wang and Dow2007), allowing users to restrict the computation of correlations to genes expressed in specific tissue or to genes with strong tissue-biased expression.
Many studies using natural genetic variation to study phenotypic variation also investigate variation in gene expression and gene co-expression (Mackay et al., Reference Mackay, Stone and Ayroles2009). However, very rarely is this information translated in the form of hypothetical functional annotation for any unannotated genes involved. We advocate that such datasets be used more routinely as patterns should emerge across studies and this information will greatly improve our understanding of genes, their function and regulation. In particular, directed analyses such as the one presented here, in which genes involved in an experimentally defined group are sought, may help to uncover pleiotropy among previously annotated genes and increase our understanding of how various biological systems function together.
This work was funded by National Institutes of Health grants R01 GM45146 (T. F. C. M.) and R0I HD038921 (M. F. W.). B. A. L. was supported by an NSF Predoctoral Fellowship. We thank G. Findlay and E. Kelleher for comments on the manuscript, and J. Mezey and MBG colleagues for use of the qPCR machine. This is a publication of the W. M. Keck Center for Behavioral Biology.