Large-Scale Hypothesis Testing and FDRs

Bradley Efron; Trevor Hastie

doi:10.1017/CBO9781316576533.016

By the final decade of the twentieth century, electronic computation fully dominated statistical practice. Almost all applications, classical or otherwise, were now performed on a suite of computer platforms: SAS, SPSS, Minitab, Matlab, S (later R), and others.

The trend accelerates when we enter the twenty-first century, as statistical methodology struggles, most often successfully, to keep up with the vastly expanding pace of scientific data production. This has been a twoway game of pursuit, with statistical algorithms chasing ever larger data sets, while inferential analysis labors to rationalize the algorithms. Part III of our book concerns topics in twenty-first-century1 statistics.

The word “topics” is intended to signal selections made from a wide catalog of possibilities. Part II was able to review a large portion (though certainly not all) of the important developments during the postwar period. Now, deprived of the advantage of hindsight, our survey will be more illustrative than definitive.

For many statisticians, microarrays provided an introduction to largescale data analysis. These were revolutionary biomedical devices that enabled the assessment of individual activity for thousands of genes at once— and, in doing so, raised the need to carry out thousands of simultaneous hypothesis tests, done with the prospect of finding only a few interesting genes among a haystack of null cases. This chapter concerns large-scale hypothesis testing and the false-discovery rate, the breakthrough in statistical inference it elicited.

Large-Scale Testing

The prostate cancer data, Figure 3.4, came from a microarray study of n = 102 men, 52 prostate cancer patients and 50 normal controls. Each man's gene expression levels were measured on a panel of N = 6033 genes, yielding a 6033 102 matrix of measurements xij,

For each gene, a two-sample t statistic (2.17) ti was computed comparing gene i 's expression levels for the 52 patients with those for the 50 controls. Under the null hypothesis H0i that the patients’ and the controls’ responses come from the same normal distribution of gene i expression levels, ti will follow a standard Student t distribution with 100 degrees of freedom, t100.

Book contents

15 - Large-Scale Hypothesis Testing and FDRs

Summary

Access options

Book contents

15 - Large-Scale Hypothesis Testing and FDRs

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive