Book contents
- Frontmatter
- Contents
- List of Contributors
- 1 Data-Intensive Computing: A Challenge for the 21st Century
- 2 Anatomy of Data-Intensive Computing Applications
- 3 Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching
- 4 Data Management Architectures
- 5 Large-Scale Data Management Techniques in Cloud Computing Platforms
- 6 Dimension Reduction for Streaming Data
- 7 Binary Classification with Support Vector Machines
- 8 Beyond MapReduce: New Requirements for Scalable Data Processing
- 9 Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time
- 10 Data-Intensive Visual Analysis for Cyber-Security
- Index
- References
9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time
Published online by Cambridge University Press: 05 December 2012
- Frontmatter
- Contents
- List of Contributors
- 1 Data-Intensive Computing: A Challenge for the 21st Century
- 2 Anatomy of Data-Intensive Computing Applications
- 3 Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching
- 4 Data Management Architectures
- 5 Large-Scale Data Management Techniques in Cloud Computing Platforms
- 6 Dimension Reduction for Streaming Data
- 7 Binary Classification with Support Vector Machines
- 8 Beyond MapReduce: New Requirements for Scalable Data Processing
- 9 Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time
- 10 Data-Intensive Visual Analysis for Cyber-Security
- Index
- References
Summary
Discovering Biological Mechanisms through Exploration
The availability of massive amounts of data in biological sciences is forcing us to rethink the role of hypothesis-driven investigation in modern research. Soon thousands, if not millions, of whole-genome DNA and protein sequence data setswill be available thanks to continued improvements in high-throughput sequencing and analysis technologies. At the same time, high-throughput experimental platforms for gene expression, protein and protein fragment measurements, and others are driving experimental data sets to extreme scales. As a result, biological sciences are undergoing a paradigm shift from hypothesisdriven to data-driven scientific exploration. In hypothesis-driven research, one begins with observations, formulates a hypothesis, then tests that hypothesis in controlled experiments. In a data-rich environment, however, one often begins with only a cursory hypothesis (such as some class of molecular components is related to a cellular process) that may require evaluating hundreds or thousands of specific hypotheses rapidly. This large number of experiments is generally intractable to perform in physical experiments. However, often data can be brought to bear to rapidly evaluate and refine these candidate hypotheses into a small number of testable ones. Also, often the amount of data required to discover and refine a hypothesis in this way overwhelms conventional analysis software and hardware. Ideally advanced hardware can help the situation, but conventional batch-mode access models for high-performance computing are not amenable to real-time analysis in larger workflows. We present a model for real-time data-intensive hypothesis discovery process that unites parallel software applications, high-performance hardware, and visual representation of the output.
- Type
- Chapter
- Information
- Data-Intensive ComputingArchitectures, Algorithms, and Applications, pp. 235 - 257Publisher: Cambridge University PressPrint publication year: 2012