1. Introduction
A common task in microarray studies is to determine which genes are differentially expressed (DE) between two cell samples. Different rules for identifying DE genes have been adopted, many based on the gene-specific mean and the standard deviation (SD). Their main limitation originates from the low number of replicates. More recently, as an alternative to these methods, a number of empirical Bayes methods have been proposed (e.g. Efron et al., Reference Efron, Tibshirani, Storey and Tusher2001; Lonnstedt & Speed, Reference Lonnstedt and Speed2002; Gottardo et al., Reference Gottardo, Pannucci, Kuske and Brettin2003; Newton et al., Reference Newton, Noueiry, Sarkar and Ahlquist2004). Empirical Bayes methods seem natural in the context of microarray experiments, because they summarize information from the whole dataset into prior parameters to be combined with means and SDs at the gene level (Lonnstedt & Britton, Reference Lonnstedt and Britton2005).
An empirical Bayes approach to differential expression is a normal mixture model proposed by Lonnstedt & Speed (Reference Lonnstedt and Speed2002). This approach was later extended by Smyth (Reference Smyth2004) to general linear models and modified into an empirical Bayes normal model (and not mixture model) for variance regularization. Smyth's (Reference Smyth2004) model represents a posterior odds statistic formulated in terms of a moderated t-statistic, in which posterior residual SDs are used in place of ordinary SDs; the model can also account for correlation among technical replicates (Smyth et al., Reference Smyth, Michaud and Scott2005). It is implemented in Limma (Smyth, Reference Smyth, Gentleman, Carey, Dudoit, Irizarry and Huber2005), a popular software package for gene expression analysis. Throughout this paper, we will refer to the empirical Bayes normal method offered in Limma (Lonnstedt & Speed, Reference Lonnstedt and Speed2002; Smyth, Reference Smyth2004; Smyth et al., Reference Smyth, Michaud and Scott2005) as the ‘BN’ method.
The underlying assumption of the BN method is normal distribution of mean ratios of expressions of DE genes, i.e. that the DE genes are symmetrically up- and down-regulated. This assumption is valid for genome-wide microarray experiments, where interest lies in detection of all DE genes between two samples, and typically, out of all the tested genes (the whole genome), only a fraction is expected to be DE, with similar numbers of up- and down-regulated genes. However, in a number of microarray experiments, only up- or down-regulated genes are of interest. Examples are certain experiments designed to identify novel markers for molecular diagnosis and therapy of diseases (Suzuki et al., Reference Suzuki, Gabrielson, Chen, Anbazhagan, van Engeland, Weijenberg, Herman and Baylin2002; Kobayashi et al., Reference Kobayashi, Nishioka, Kohno, Nakamoto, Maeshima, Aoyagi, Sasaki, Takenoshita, Sugimura and Yokota2004). Furthermore, in some experiments, regulation is expected in one direction only (either up or down) due to the design of the experiments, such as in certain ‘wild-type versus mutant’ microarrays (Kazmierczak et al., Reference Kazmierczak, Mithoe, Boor and Wiedmann2003; van Schaik et al., Reference van Schaik, van der Voort, Molenaar, Moezelaar, de Vos and Abee2007) and those designed to select DE genes among a group of potentially up- or down-regulated genes that have been pre-selected either as part of some genome-wide microarray or other (e.g. Hidden Markov Chain promoter search) screening methods (Kazmierczak et al., Reference Kazmierczak, Mithoe, Boor and Wiedmann2003). In such experimental designs, where only up- or down-regulated genes are relevant or expected, the distributional assumption of normality of the BN method is obviously violated.
As a long-tailed alternative to the normal mixture model of Lonnstedt & Speed (Reference Lonnstedt and Speed2002), which assumes that the DE genes are symmetrically up- and down-regulated, Bhowmick et al. (Reference Bhowmick, Davison, Goldstein and Ruffieux2006) proposed a Laplace mixture model (hereinafter referred to as the ‘BL’ method). Bhowmick et al. (Reference Bhowmick, Davison, Goldstein and Ruffieux2006) showed that the asymmetric gene expression data fit better under the asymmetric BL method. However, while the performance of BL was similar to the Smyth (Reference Smyth2004) and Lonnstedt & Speed (Reference Lonnstedt and Speed2002) methods, BL depends on a large number of replicates for acceptable parameter estimates. Hence, microarrays with a single relevant or expected direction of regulation lack an appropriate analytical tool.
Extreme value theory (EVT) has been traditionally used for risk and financial analysis, and studies of extreme events, such as intense rainfall and floods. An important feature of EVT is that it models the extreme behaviour (the tail of the distribution) rather than the average behaviour of the systems as classical statistics do. By focusing on the values located in the tail of the distribution that diverge extremely from the mean value of a dataset, EVT provides a natural framework for detection of DE genes in an experiment where regulation is relevant or expected in a single direction. Therefore, our objective was to develop a new statistical method, based on EVT, for analysis of differential expression in experiments interested in or expecting either up- or down-regulated genes. Application to two experimental datasets and simulation studies indicated comparatively better performance of the developed empirical Bayes extreme value distribution (EVD) mixture model (BE) as compared with the two existing empirical Bayes methods (BN and BL).
2. Methods
(i) Model setup in the context of linear models
Consider a microarray experiment comparing the expression (i.e. mRNA transcript) levels of wild-type and mutant cells, where mutant cells lack a gene encoding a positive regulator. We wish to identify genes that differ in their transcript levels between wild-type and mutant cells exposed to the same treatment. For the jth replicate of gene g on array i, we use the log ratio of the expressions:
Let us assume that we have n arrays, where each array has G number of genes spotted on it, and each gene is replicated m times. Therefore, the complete set of data from the experiment consists of Y gij, g=1, … ,G, i=1, … ,n, and j=1, … ,m, and so Yg can be viewed as a vector of mn log ratios observed for a gene g. We regard the Y gij as random variables from a normal distribution with mean μg and variance σg2, which, although not completely true, is convenient and found empirically to be roughly the case (Lonnstedt & Speed, Reference Lonnstedt and Speed2002), i.e.
A general microarray experiment can be represented by a linear model (Smyth, Reference Smyth2004), i.e. E(Yg)=Xβg, where X is an nm×k dimensional design matrix specifying experimental design and βg is a vector of k regression coefficients. As we have m replicates of each gene on each array, our design matrix has m repeated rows corresponding to each set of m replicate spots. To simplify E(Yg)=Xβg, let Ῡg be the n-vector of array means Ῡ gi and let be the reduced n×k dimensional design matrix in which there is only one row instead of m rows for each array combination (Smyth et al., Reference Smyth, Michaud and Scott2005). Then, E(Ῡg)=βg. Now, let αgq=cTβg, where c is a vector of q known constants defining contrasts of biological interest and αgq is a particular contrast or linear combination of the regression coefficients and suppose that interest lies in testing H 0: αgq=0 (Smyth et al., Reference Smyth, Michaud and Scott2005). For the rest of the paper, the subscript q will be suppressed for notational simplicity under the assumption that our hypothetical experiment has only one contrast of interest. We suppose that measurements from genes spotted on different arrays are independent. However, we acknowledge that within an array technical replicates of a gene are correlated. To account for correlation among replicates, we used a common correlation factor (Smyth et al., Reference Smyth, Michaud and Scott2005). Let be the estimator of βg from fitting a linear model and let contrast estimator . Here, is assumed to be a random variable from a normal distribution (as in (1)) with mean μg and variance νg2 σg2, where νg2 is an unscaled variance of that also accounts for a common correlation factor among replicates (Smyth et al., Reference Smyth, Michaud and Scott2005), i.e.
The residual variance from fitting a linear model (s g2) was assumed to follow a gamma distribution (G; equivalent to the scaled chi-square distribution assumed by Smyth, Reference Smyth2004):
where d g is the residual degrees of freedom for the linear model for gene g.
(ii) Empirical Bayes EVD mixture model (BE)
The EVD has three parameters: a location parameter, a; a scale parameter, b; and a shape parameter, c. It encompasses three classes of distributions with types I, II and III widely known as the Gumbel, Fréchet and Weibull families, respectively (Coles, Reference Coles2001). These are combined into a single-family model known as the generalized extreme value (GEV) family of distributions, where the shape parameter (c) determines the type; the Fréchet and Weibull families have c>0 and c<0, respectively, while the shape parameter of the Gumbel family has c=0. Throughout the present paper, we will use EVD and GEV interchangeably, both indicating an unspecified family of EVD. In contrast, when we refer to a specific family we will use its name (i.e. Gumbel, Fréchet or Weibull distribution).
For an appropriately background corrected and normalized gene to be DE, its μg value should be statistically different from zero. Let I g indicate whether the gene is DE (μg≠0) or not (μg=0), such that
We assume that
where δ(0) denotes the distribution which places unit mass at μg=0. Because we expect DE genes to be either up- or down-regulated, we assumed an EVD prior on μg≠0, i.e.
An inverse gamma (IG) prior distribution is assumed on σg2, with σg2 equivalent to a prior estimator s 02 with d 0 degrees of freedom (Smyth, Reference Smyth2004), i.e.
We wish to know whether the gene is DE, i.e. what is or what are the odds of differential expression (more specifically, the natural logarithm of the odds), BEg:
By the Bayes theorem,
Then, the log Bayes' factor is
where Pr(I g=1) is the proportion of DE genes (pDE) in the experiment. The joint densities and are:
To evaluate the probabilities and , and the BE statistic in (7), we used the distributions defined in equations (2)–(6), i.e. their respective likelihood or probability density functions:
where ;
where d g/2>0, d g/2σg2>0, s g2>0;
defined on {μg:1+(μg−a)/b>0}, where −∞<μg<∞, b>0 and −∞<c<∞ (Coles, Reference Coles2001; Panjer, Reference Panjer and Panjar2006);
where d 0/2>0, d 0s 02/2>0, σg2>0.
It is not possible to integrate the integral . Therefore, numerical approximation through the Monte Carlo (MC) integration method (Tanner, Reference Tanner1996) was applied to obtain the posterior expectations of and (denoted as E(f Ig=1) and E(f Ig=0), respectively):
where r denotes the number of iterations, while μgi and σgi2 denote iid draws from prior distributions of gene-specific means and variances, respectively. Thereafter, the expected value of the BE statistic is:
For an estimator to be useful, a measure of estimation error is required. The estimated Monte Carlo standard errors (MCSEs) of E(f Ig=1) and E(f Ig=0) (denoted as SE(f Ig=1) and SE(f Ig=0), respectively) were:
Thereafter, by accounting for error propagation, the MCSE of the BEg (MCSEg) was estimated by adding uncertainties in E(f Ig=1) and E(f Ig=0), i.e. SE(f Ig=1) and SE(f Ig=0), in quadrature (Taylor, Reference Taylor1982):
BE statistics were evaluated on a PC running on Windows XP Pro SP2 with an AMD Athlon 64 X2 Dual Core Processor 4200+(2200 MHz) and 2048 Mb of RAM. For a dataset with 200 genes spotted in triplicate on four arrays (similar to datasets described in Section 3(i)), the computing time for the BE method involving 50 000 iterations was roughly 15 s, while the BN and BL methods took roughly 1 s each. For a considerably larger dataset, with 10 000 genes also spotted in triplicate on four arrays, the computing time for the BE method with 50 000 iterations was roughly 16 min, while it was roughly 1 min for each of the BN and BL methods. The R code implementing the proposed methodology is available on request from the first author.
(iii) Estimation of hyperparameters
The BE statistic defined in Section 2(ii) uses hyperparameters estimated from the data. The parameters s 02 and d 0 are estimated from s g2 following Smyth's (Reference Smyth2004) procedure. The only exception to this rule was when d 0 could not be estimated as in Smyth (Reference Smyth2004), indicating that there is no evidence that the underlying gene-specific variances σg2 vary between genes, and so in Smyth (Reference Smyth2004)d 0 was set to positive infinity. In evaluation of BE statistics, d 0 equal to positive infinity precludes generation of a prior on σg2. Therefore, in such situations, instead of positive infinity we set d 0 to 10300, which is close to the largest positive decimal number on a typical R platform. In terms of pDE, the BE method uses the same approach as the BN method; it fixes pDE and then estimates the remaining hyperparameters (a, b and c). Hyperparameters a, b and c are estimated simultaneously from values corresponding to a pDE fraction of genes with the highest moderated t-statistics. The method involves maximum likelihood estimation (MLE) fitting for the GEV distribution, with the Nelder–Mead optimization method (Nelder & Mead, Reference Nelder and Mead1965). The MLE procedure, subject to the limitations discussed in Section 5, provides the means and standard errors (SEs) for the parameters a, b and c. The means are used to estimate BE statistics, and the corresponding SEs provide an indication of the uncertainty around the estimated mean values. Genes with the top moderated t-statistics (estimated according to Smyth, Reference Smyth2004) were used instead of the top genes ranked by their ordinary t-statistics as they provided a more stable estimation of a, b and c.
3. Application to experimental data
(i) Data description
To analyse the performance of BE as compared with BN and BL, we used two datasets generated in a previous two-color cDNA microarray study conducted to identify genes regulated by the sigma factor σB in the bacterium Listeria monocytogenes (Kazmierczak et al., Reference Kazmierczak, Mithoe, Boor and Wiedmann2003). In that study, an L. monocytogenes sigB null mutant (which lacks the σB protein) and a parent strain with intact sigB gene (wild-type) were exposed to two stress conditions, namely osmotic stress and stationary phase, to identify genes with transcript levels affected by the sigB deletion under these two conditions. For each stress condition, two independent RNA isolates (biological replicates) for both wild-type and sigB mutant cells were dye swapped for a total of four arrays per stress condition. Each array included 211 test genes, and a number of non-hybridizing and normalization controls (for details see Kazmierczak et al., Reference Kazmierczak, Mithoe, Boor and Wiedmann2003) spotted in triplicate. Most (166) genes included on the array were identified by Hidden Markov Model promoter searches as being preceded by a putative σB-dependent promoter, while some genes (36) were included because of previous reports of their involvement in virulence or stress response. As σB is a positive regulator of gene expression with particular importance for regulating stress response and virulence genes, most genes in these two experiments are expected to show higher transcript levels in the wild-type strain as compared with the sigB deletion strain.
In their analysis, Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003) considered all individual spots as repetitions, generating 24 data points for each gene (3 spots per gene×4 arrays×2 channels per array), i.e. correlation among technical replicates was not considered. They reported findings for 208 of the 211 test genes as three genes were spotted twice. Prior to analysis, cross-slide mean normalization (without background correction) and flooring were performed. The analysis by the Significance Analysis of Microarrays (SAM) program (Tusher et al., Reference Tusher, Tibshirani and Chu2001) identified 51 (25%) and 41 (20%) genes with at least 1·5-fold different statistically significant expressions under osmotic stress and stationary-phase conditions, respectively.
Prior to our analysis of the two datasets of 211 genes, we performed the background correction and normalization. The median background fluorescence intensities are usually recommended for correction of the background noise because of their robustness to the outliers. We, however, used the mean background intensities because the distribution of median background intensities had a bimodal distribution with some spots having zero background while the others were in the higher range of intensities (above 28) (possibly due to the setup or limitations of the laser scanner used).
Two background correction procedures seemed appropriate for the data. The first, the normal-exponential convolution background correction model (NeBC) (performed with an offset of 100), involves fitting of the convolution of normal and exponential distributions to the foreground intensities using the background intensities as a covariate (also referred to as the normexp method in Smyth, Reference Smyth, Gentleman, Carey, Dudoit, Irizarry and Huber2005). The second procedure used was multiplicative background correction (MBC). This is a novel approach that involves logarithmic transformation of the intensity readings before the background correction and is found (via a series of examples) to be superior to the additive background correction and no background correction (Zhang et al., Reference Zhang, Zhang and Wells2006). Because MBC reportedly gives fewer false positives than conventional additive background correction (Zhang et al., Reference Zhang, Zhang and Wells2006) and because its performance has never been contrasted with NeBC, we used (and compared) both background correction models in our study.
The normalization appropriate for the data was the Lowess normalization (Cleveland & Devlin, Reference Cleveland and Devlin1988), with up-weighting of the background and normalization control spots, known to be non-DE (http://bioconductor.org/packages/1.8/bioc/vignettes/limma/inst/doc/usersguide.pdf). Application of the two background correction procedures (NeBC and MBC) to each of the two stress-condition datasets (osmotic stress and stationary phase) provided a total of four real model-datasets used in our analyses.
(ii) Results
In all four model-datasets, normalized and background corrected log2 ratios between genes' expression values in wild-type and mutant cells (Y gij) were distributed asymmetrically around zero and heavily skewed to the right. This was expected because up-regulation was anticipated in most of the tested genes. It was therefore reasonable to assume that the distribution of mean expressions of DE genes follows EVD. Hence, the BE method could be applied for inference about differential expression.
A critical issue in the MC integration methodology, underlying the BE method, is to determine the number of iterations that can be safely used as a basis for inference. We used 50 000 iterations as they provided reasonable accuracy of the approximated BE statistics. The achieved MCSEs varied for different genes and model-datasets. The medians, followed by the ranges in parentheses, of the achieved MCSEs were 0·05 (0·01–0·42) and 0·03 (0·01–0·82) for the osmotic stress datasets corrected with the NeBC and MBC methods, respectively, and 0·38 (0·02–0·92) and 0·18 (0·02–0·52) for the stationary-phase datasets corrected with the NeBC and MBC methods, respectively. In all four model-datasets, MCSEs were the lowest (<0·1) for the genes with the value of BE statistic around 0.
For each model-dataset, the gene-specific BN, BL and BE statistics were approximated. The biological meaning of the identified DE genes is important. Hence, for each of the four model-datasets in Fig. 1, we show the values of the BN, BL and BE statistics, plotted against contrast estimators from linear models () (also translated into fold changes, , for more intuitive interpretation) and against previous results of Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003). In each model-dataset, genes that ranked very low with the BE statistic have a fold change below 1. At the same time, the BN statistic ranked high some of the genes with very low fold change, incorrectly suggesting down-regulation. The BL statistic gave ambiguous results with high values for most genes, particularly in the stationary-phase data. It should be noted that for approximation of the BN and BE statistics, we fixed the pDEs to those reported in Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003). Fixing pDEs to different values would change the BN and BE vs. fold change plots. Decreasing pDE would shift the plots to the right and down, whereas increasing pDE would shift the plots left and up on the x- and y-axes, respectively.
Table 1 shows the characteristics of the data and the values of the hyperparameters estimated for each of the four model-datasets. The ambiguous results of the BL method are probably due, at least in part, to a very high estimated probability that a gene is DE (w=1; Table 1). The prior variance distributions seem quite stable among the BN, BL and BE methods, except for the roughly double value of the scale parameter estimated for the BL method as compared with that estimated for the BN and BE methods. Contrary to that, the prior variances differ substantially between background correction methods, being narrower for the data corrected with MBC, which may explain the smoother plots of the BN, BL and BE statistics following MBC. Also, interestingly, correlation among technical replicates tends to be higher following the NeBC than MBC, demonstrating the difference between these two procedures.
a NeBC=normal-exponential convolution background correction method; b MBC=multiplicative background correction method; c DE=differentially expressed; d EVD=extreme value distribution; e IG=inverse gamma distribution; f N=normal distribution; g L=Laplace distribution; hw=the probability that a gene is DE estimated as part of the BL method (note that the BN and BE statistics use a fixed, user-defined pDE).
In the BE statistic, a natural choice of the optimal threshold (OT) above which a gene could be considered DE is 0. However, the actual OT depends on the imposed criteria, such as the cost of a false positive and false negative. A typical approach in choosing a rule for interpretation of a statistical test is to control the type I error probability while maintaining a certain power. A sensible, powerful and easy to interpret (Verhoeven et al., Reference Verhoeven, Simonsen and McIntyre2005) method to control type I error when multiple statistical tests are performed is the false discovery rate (FDR) (Benjamini & Hochberg, Reference Benjamini and Hochberg1995). FDR is the expected proportion of errors among the genes selected to be DE. As a low FDR often comes at the cost of low sensitivity or power (i.e. a high false negative rate (FNR)), these should be controlled jointly (Pawitan et al., Reference Pawitan, Michiels, Koscielny, Gusnanto and Ploner2005). Because Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003) considered genes that had been pre-selected for their expected differential expression, we chose an FDR=0, i.e. no false positives were acceptable. The OT for BE (its 5th and 95th percentiles) was determined through simulation analysis for each of the four model-datasets (assuming that the pDEs reported in Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003) are true), and is shown in Fig. 1, together with the associated FNR. Genes whose BE statistic was above the 95th percentile of the OT could be considered DE with a high certainty. Genes with a BE statistic between the 5th and 95th percentiles of the OT are likely to be DE. BE did rank high (above the 95th percentile of the OT) some of the genes previously unidentified by Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003), whereas a few of the genes previously reported as DE by Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003) were ranked low (below the 5th percentile of the OT). However, the findings of the BE method have been validated by other independent studies for most of the genes, for which the result of the BE method differed from those reported by Kazmierczak et al. (Reference Kazmierczak, Mithoe, Boor and Wiedmann2003) (elaborated in the Appendix).
4. Simulation studies
Based on the hyperparameters estimated from the data and shown in Table 1, 100 datasets were simulated for each of the four model-datasets. To ease computational burden, we reduced the number of iterations in the MC integration model underlying the BE statistics from 50 000 to 5000. This reduction was warranted by our empirical observation that genes with BE around zero (a natural cutoff value for separation of DE and non-DE genes) have the lowest MCSEs. With 5000 iterations, the achieved MCSEs of these genes were below 0·2. So, the reduction did not jeopardize the ability of the BE to differentiate between DE and non-DE genes.
The performance of the BN, BL and BE statistics was tested under the Normal, Laplace and EVD models. The BE statistic did not do well under the asymmetric Laplace model and did even worse under the Normal model. That is expected because the Normal, and to a lesser extent the asymmetric Laplace, model permit both up- and down-regulated genes, while the EVD model allows one direction of expression (either up- or down-regulation) only. Because the Normal and Laplace models do not fit to data from experiments where only one direction of expression is expected, we only present results of the simulation under the EVD model.
Simulated datasets (100 of them) were generated under different scenarios. These were analysed with BN, BL and BE and the performance of the three statistics was evaluated based on the FDR vs. FNR plots, where the lower left corner indicates perfect performance, with zero false positives and false negatives. Figure 2 shows FDR vs. FNR plots of 100 simulated datasets overlaid by the horizontal average curve and box plots showing horizontal spread of the performance of the three statistics in the four simulated model-datasets. Averaging over simulated datasets horizontally mimics the situation when the experimenter chooses a tolerable FDR and tries to maximize power. Results for vertical averaging, corresponding to situations when an experimenter chooses an acceptable power while minimizing FDR, are not shown as they were very similar to the results for horizontal averaging. Figure 2 shows that the three statistics do very well in all four model-datasets. The only exception was BL: analysis failed for several simulated datasets reproduced from the osmotic model-dataset background corrected with MBC. To assure fair comparison, the performance of the BL method in analysis of this model-dataset is not plotted in Fig. 2. A curve showing average FNR for different values of FDR (horizontal average curve) in Fig. 2 indicates that BE, on average, has better accuracy (correct classification of genes as DE and non-DE) in the simulated data than BN and BL. The horizontal box plots of the achieved FNR for various FDR show narrower spread of BE compared with the other two statistics, indicating better precision (repeatability of classification success) in the simulated data.
Robustness of the BE statistic, as compared with the BN and BL statistics, was tested by varying several key characteristics of a microarray experiment (pDE, n, m and G). The pDE was increased and decreased by +10% and −10%, respectively. Figure 3 shows, not surprisingly, that as the pDE decreases all three B statistics perform worse, while they do better as the pDE increases. However, overall, BE consistently performed the best in these simulated datasets. In the osmotic stress data background corrected with MBC, BL failed under all three simulation scenarios (pDE−0·10, pDE and pDE+0·10) in several of 100 simulated datasets and thus is not shown.
To test the robustness of the model to the number of arrays (n), usually representing the number of biological replicates, we simulated the four model-datasets with n=2, 4 and 8. Figure 4 shows that BE consistently performed better than BN and BL, but the margin of BE superiority became smaller as the number of arrays increased. Interestingly, the improvement in the performance gained with the larger number of arrays was more apparent in the model-datasets corrected with MBC. The wiggly plot of BL for n=2 in osmotic stress and stationary-phase data corrected with NeBC is a consequence of outliers; in a few simulated datasets, there was a high FNR estimated for different values of FDR. When osmotic stress and stationary-phase data corrected with MBC were used, BL failed in the analysis of several simulated datasets with n=2 and 4 and n=2, respectively.
We also tested the sensitivity of the model to the number of technical replicates (m); we simulated m=1, 3 and 6 (Fig. 5). BE did better in all four simulated model-datasets, and under all simulated numbers of technical replicates. BL failed in the analysis of several simulations of the MBC osmotic stress dataset when simulated with all three tested values of m (1, 3 and 6). Interestingly, in the stationary-phase data (and to a lesser extent in the osmotic stress data) corrected with NeBC, the average performance of all three statistics was better under the simulation with m=3 than m=6. This may be a consequence of overfitting to the empirical data in estimation of hyperparameters. The performance of all three statistics, particularly BN and BL, was remarkably good in the simulated stationary-phase data corrected with MBC.
Finally, we also tested the performance of the BE method in experiments characterized with a high number of tested genes but with only a few DE genes (Fig. 6). We ran the models with combinations (G=200, pDE=0·05) and (G=1000, pDE=0·01) (note equal number (10) of DE genes in both the settings). BE did better than the other two statistics although the margin was not great, particularly in the stationary-phase data. That is expected because as the number of genes increases and only a few genes are DE, the distribution of gene-specific means becomes closer to the normal distribution. BL failed in the analysis of several realizations of the osmotic stress data corrected with MBC under the scenario with G=200 and pDE=0·05.
As illustrated above, in all simulated settings, the performance of the BE model was at least as good as the performance of the other two models. BL failed in several simulated datasets under several simulation settings, mostly due to errors in integration and/or hyperparameter estimation. We would like to add here that we performed simulation analysis on the datasets generated under the modified Normal model that allows only mean ratios of expressions larger than 1 (i.e. higher gene expression in the wild-type as compared with the mutant). The results were similar to those obtained from simulation under the EVD model, so they are not shown here.
5. Discussion
We proposed a novel method (BE) for analysis of microarray data from experiments where only up- or down-regulated genes were expected or relevant. Its merit originates from an empirical Bayes framework and foundation in the EVT, as well as its good accuracy, precision and stability demonstrated in the simulated datasets. The main limitations of the BE method pertain to its sensitivity to the assumed pDE and the higher computational cost of its underlying numerical approximation through the MC integration method.
The main asset of empirical Bayes methods in analysis of differential expression is their use of information from hundreds or thousands of simultaneously tested genes to support testing at the gene level. However, empirical Bayes methods do make distributional assumptions, particularly when the choice of priors for the parameters is limited to conjugate priors (Lonnstedt, Reference Lonnstedt2001). We stepped out of the conjugate prior class, and in experiments where only up- or only down-regulated genes are expected or relevant, assumed that the mean ratios of expressions of DE genes follow the EVD. In microarrays expecting a single direction of regulation, the validity of this assumption is based on the fact that the distribution of the true genes' mean ratios of expressions (the log of it) cannot be symmetric around zero. In the experiments where only a certain direction of regulation is relevant, it is advantageous to use the BE over the BN method even if the gene expression data are more or less symmetric around zero. That is because, being restricted to one direction of expression, due to the design of the experiment or the interest of an investigator, the genes characterized by such expressions are actually ‘selected’ beyond a threshold expression, and so, they show extreme behaviour.
The three families in EVD (Gumbel, Fréchet and Weibull) have distinct forms of tail behaviour, with finite limit distributions in its upper end-point for the Weibull distribution, and infinite ones for the Gumbel and Fréchet distributions (Coles, Reference Coles2001). Furthermore, the density in the upper end-point decays exponentially for the Gumbel distribution and polynomially for the Fréchet distribution (Coles, Reference Coles2001). While, consequently, the three families give quite different representations of the extreme value behaviour, it may not be obvious which best represents the data and the problem at hand. It is thus convenient to use a single family GEV rather than ‘guessing’ which of the three original EVD families is proper, because the data themselves (through inference about the shape parameter) determine the most appropriate type of tail behaviour (Coles, Reference Coles2001).
To estimate the hyperparameters of the EVD, we applied the MLE procedure. While many techniques have been proposed for estimation of hyperparameters in extreme value models (including graphical techniques and moment-based and likelihood-based methods), the MLE approach is particularly appealing due to its all-around utility and adaptability (Coles, Reference Coles2001). Indeed, our analyses showed that BE works well even when hyperparameters a, b and c are estimated from a small sample of genes (~10) related to smaller pDE and/or a smaller number of tested genes G. However, caution is needed when MLE is used to estimate means and SEs of EVD hyperparameters. Because the end-points of the GEV distributions are functions of the parameter values, regularity conditions required for the usual asymptotic properties associated with the maximum likelihood (ML) estimator are not satisfied by the GEV model (Coles, Reference Coles2001). Particularly, in the experiments involving detection of down-regulated genes, if the value of the shape parameter (c) lies between −1 and −0·5, ML estimators are generally obtainable, but do not have the standard asymptotic properties, while when c<−1, ML estimators are unlikely to be obtainable (Smith, Reference Smith1985). Conversely, when c>−0·5, ML estimators are regular (have the usual asymptotic properties) (Smith, Reference Smith1985). Therefore, in experiments designed to detect up-regulated genes, where c>0, the usual asymptotic properties immediately apply. For detection of down-regulated genes, it is reasonable to negate the contrast estimators from gene-specific linear models before estimation of the MLE hyperparameters and subsequent approximation of BE statistics, i.e. to switch from modelling minima to modelling maxima.
While only one direction of differential expression is often expected in the wild-type vs. mutant experiments, it is possible to observe a few genes with their expression in an opposite direction. For example, while the sigma factor σB is a positive regulator of gene expression, some genes show higher transcript levels in a sigB null mutant, likely representing indirect effects, which may, however, be biologically relevant. For example, in the bacterium Staphylococcus aureus, genes encoding a number of exoenzymes and toxins show higher transcript levels in a sigB null mutant strain as compared with the parent strain with an intact sigB gene, suggesting indirect negative regulation of these genes by σB (Bischoff et al., Reference Bischoff, Dunman, Kormanec, Macapagal, Murphy, Mounts, Berger-Bachi and Projan2004). To test differential expression of a suspected biologically relevant gene regulated in the direction opposite to the direction of regulation of the majority of the genes in the experiment, its contrast estimator should be negated prior to applying the BE method. Otherwise, the BE method would rank this gene low, indicating non-differential expression.
Because there is no conjugate structure between the normal likelihood of a gene being DE and the EVD prior, it is impossible to estimate BE analytically. Therefore, the MC integration approximation technique was applied. While this allowed us to base the choice of the prior on the nature of the problem, rather than on the available conjugate priors, it came at the price of a higher, but still reasonable (e.g. 16 min for 50 000 iterations in a dataset with 10 000 genes), computational cost related to a large number of tested genes and (usually) a large number of iterations required to achieve a reasonable accuracy of the BE statistics (because the accuracy increases only as the square root of the number of iterations). The MCSEs of the estimated BE statistics differed for various genes, being the lowest for genes with a BE statistic around zero. That is as expected because genes with a high mean expression ratio, corresponding to the upper tail of the EVD prior, will likely have a high BE statistic. Similarly, genes with a high variance, corresponding to the upper tail of the variance prior, will likely have a low BE statistic. At each iteration, the probability of drawing a sample from a tail is lower than from a body of the prior distribution, so these genes require more iterations to achieve sufficiently low MCSEs. If the BE statistic is used to identify DE genes, as in the present study, an even lower number of iterations may be sufficient. However, when BE is used with the purpose of precise ranking of genes by their strength of differential expression, a larger number of iterations may be necessary.
As discussed by Smyth (Reference Smyth2004), it is possible to estimate pDE from the data, for example, from posterior odds through MLE or based on the P-values from t-statistics (moderated or ordinary) (such as in Langaas et al., Reference Langaas, Lindqvist and Ferkingstad2005 and Wu et al., Reference Wu, Guan and Zhao2006). Nevertheless, estimation of pDE is unstable (related to collapse in estimation of posterior odds caused by boundary values of pDE=1 and pDE=0 having positive probability), and may be sensitive to the particular prior distribution assumed for the contrast estimators from linear models and to dependence between the genes (Smyth, Reference Smyth2004). To bypass these problems, Smyth (Reference Smyth2004) and Lonnstedt & Speed, (Reference Lonnstedt and Speed2002) fixed pDE and then estimated the remaining hyperparameter. This same approach has been adopted in the BE method. A wrong choice of pDE would not change the shape of the BE statistic vs. fold change plot (i.e. ranking of the genes would stay intact), but it would move the BE up and down on the y-axis (as in Lonnstedt & Speed, Reference Lonnstedt and Speed2002). Also, through influence of the pDE on the hyperparameters of EVD, it would vary fold change corresponding to BE=0. Accepting that BE=0 is a natural cutoff separating DE from non-DE, the fold change corresponding to this point represents the ‘meaningful fold change cutoff’ above which a gene could be considered meaningfully DE. This ‘meaningful fold change cutoff’ could give some indication of whether the chosen pDE is wrong. For example, in experiments expecting only up-regulated genes, the ‘meaningful fold change cutoff’ ⩽1 would indicate that the selected pDE is too high.
The accuracy and precision of BN, BL and BE were assessed based on the achieved FDR and FNR over 100 simulated datasets generated under the EVD model. In an experiment, a researcher could choose a tolerable FDR and try to minimize FNR (i.e. maximize power) or choose a tolerable FNR and try to minimize FDR. In the simulated datasets, we mimicked both situations, and in both, the accuracy of the BE method was at least as good as or better than the other two tested methods. In addition to being more accurate, BE was also more precise than BN and BL in the simulated datasets. We also tested the sensitivity of BE to several key characteristics of a microarray experiment, namely, the pDE, the number of arrays (usually representing the number of biological replicates) and the number of technical replicates, as well as its performance in experiments with a high number of tested genes but with only a few DE genes. Overall, BE performed the best in all four simulated model-datasets and under all the tested simulated scenarios. When applied to real datasets, BE performed well, with most of the findings being validated by other independent studies (elaborated in the Appendix). This ‘reality check’ is a valuable addition to simulation studies because datasets used in simulation analyses have been replicated from the same model as the BE is built upon, thus limiting ability of the simulation studies to give a completely unbiased evaluation of BE performance.
6. Conclusions
In microarray experiments where only one direction of expression is expected or relevant (only up- or down-regulation), such as in certain experiments involving wild-type vs. knockout design (Kazmierczak et al., Reference Kazmierczak, Mithoe, Boor and Wiedmann2003; van Schaik et al., Reference van Schaik, van der Voort, Molenaar, Moezelaar, de Vos and Abee2007) and some restricted coverage arrays, including those performed with the purpose of identifying novel gene markers and drug targets (Suzuki et al., Reference Suzuki, Gabrielson, Chen, Anbazhagan, van Engeland, Weijenberg, Herman and Baylin2002; Kobayashi et al., Reference Kobayashi, Nishioka, Kohno, Nakamoto, Maeshima, Aoyagi, Sasaki, Takenoshita, Sugimura and Yokota2004), the nature of genes' true mean ratios of expressions is extreme. EVT was designed specifically to study extreme behaviour, and is therefore an excellent candidate for use in analysis of differential expression in such experiments. In this paper, we proposed a new empirical Bayes method, BE, for analysis of differential expression that is based on the EVT, and we compared its performance with two other empirical Bayes methods reportedly used in analysis of such data. The first of the two methods, BN, is a very popular method that is, however, based on a distributional assumption invalid for experiments where only one direction of expression is anticipated or of interest. The distributional assumption of the other method, BL, is valid, but the method is unstable and its performance is still comparable with BN. Based on the series of simulation analyses and analyses of two real datasets, we believe that the BE method has greater accuracy, precision and stability than the BN and BL methods. It thus seems promising and useful for analysis of differential expression in experiments where only up- or down-regulated genes are relevant or expected. Because custom, restricted-coverage microarray experiments, including those described here, are likely to become much more common in the future due to their possible use in therapeutic and diagnostic applications (Liu-Stratton et al., Reference Liu-Stratton, Roy and Sen2004), development and use of appropriate bioinformatics tools will become vital, and the advantages of the BE method may become even more evident.
This work was funded by the National Institutes of Health (RO1-AI052151-01A1 to Kathryn J. Boor and M. W.) and the United States Department of Agriculture (USDA) National Needs Fellowship grant (2002-38420-11738S to S. R.).
Appendix
The appendix for this paper is available online at the journal's website (http://journals.cambridge.org/grh).