Bioinformatics in otolaryngology research. Part one: concepts in DNA sequencing and gene expression analysis

T J Ow; K Upadhyay; T J Belbin; M B Prystowsky; H Ostrer; R V Smith

doi:10.1017/S002221511400200X

Bioinformatics in otolaryngology research. Part one: concepts in DNA sequencing and gene expression analysis

Published online by Cambridge University Press: 16 September 2014

T J Ow ,

K Upadhyay ,

T J Belbin ,

M B Prystowsky ,

H Ostrer and

R V Smith

Show author details

T J Ow*: Affiliation:
Department of Otorhinolaryngology – Head and Neck Surgery, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA Department of Pathology, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA
K Upadhyay: Affiliation:
Department of Pathology, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA
T J Belbin: Affiliation:
Department of Pathology, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA
M B Prystowsky: Affiliation:
Department of Pathology, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA
H Ostrer: Affiliation:
Department of Pathology, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA Department of Pediatrics, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA
R V Smith: Affiliation:
Department of Otorhinolaryngology – Head and Neck Surgery, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA Department of Pathology, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA Department of Surgery, Montefiore Medical Center and Albert Einstein College of Medicine, Bronx, New York, USA
*: Address for correspondence: Dr Thomas J Ow, Department of Otorhinolaryngology – Head and Neck Surgery, Montefiore Medical Center, 3rd Floor MAP Building, 3400 Bainbridge Avenue, Bronx, New York10467, USA E-mail: [email protected]

Article contents

Abstract
Background:
Objectives:
Conclusion:
Introduction
Common bioinformatics applications in ENT
Bioinformatics in DNA sequencing
Analysis of DNA sequences
Bioinformatics for analysis of high-throughput gene expression data
Conclusion
References

Rights & Permissions

Abstract

Background:

Advances in high-throughput molecular biology, genomics and epigenetics, coupled with exponential increases in computing power and data storage, have led to a new era in biological research and information. Bioinformatics, the discipline devoted to storing, analysing and interpreting large volumes of biological data, has become a crucial component of modern biomedical research. Research in otolaryngology has evolved along with these advances.

Objectives:

This review highlights several modern high-throughput research methods, and focuses on the bioinformatics principles necessary to carry out such studies. Several examples from recent literature pertinent to otolaryngology are provided. The review is divided into two parts; this first part discusses the bioinformatics approaches applied in nucleotide sequencing and gene expression analysis.

Conclusion:

This paper demonstrates how high-throughput nucleotide sequencing and transcriptomics are changing biology and medicine, and describes how these changes are affecting otorhinolaryngology. Sound bioinformatics approaches are required to obtain useful information from the vast new sources of data.

Keywords

Bioinformatics Otolaryngology Sequencing Analysis, DNA High-Throughput Nucleotide Sequencing Gene Expression Transcriptome

Type: Review Article
Information: The Journal of Laryngology & Otology , Volume 128 , Issue 10 , October 2014 , pp. 848 - 858

DOI: https://doi.org/10.1017/S002221511400200X [Opens in a new window]
Copyright: Copyright © JLO (1984) Limited 2014

Introduction

The twenty-first century is proving to be an era driven by information. The increase in computing power and speed, the overwhelming impact of the internet, and the abundance of data available during the last two decades has changed nearly every aspect of life. The staggering amount of readily available information has led to the establishment of ‘informatics’, a discipline that focuses on the storage, retrieval and processing of data.

The biological sciences have been greatly impacted by the ‘information age’, which has led to the development of the field of bioinformatics. Advances in genetics and molecular biology have been driven recently by methodologies that generate massive amounts of data in a short time. For example, next-generation sequencing techniques can provide the base pair sequence of an entire human genome in a matter of weeks. Microarray technology can assess the gene expression or methylation status of tens of thousands of genes, probe a million single nucleotide polymorphisms, or capture DNA fragments composing the entire human exome on a single array chip.

These advances have led to an equivalent explosion in published scientific information that is readily accessible to the medical and scientific community worldwide. Bioinformatics is a multidisciplinary field that integrates a vast array of subjects, including computer science, mathematics, statistics and biology, with the goal of optimising the acquisition, storage, analysis, interpretation and application of biological data. An important goal of translational research in medicine today is to utilise and interpret this massive amount of data to yield information that is applicable to patient care, ultimately resulting in improved strategies for diagnosis, prognostication and treatment.

The advances highlighted above have already had a great impact on medical practice and the field of otolaryngology. Otolaryngology is a subspecialty that focuses on several congenital, inflammatory, immunological, infectious and neoplastic disorders, and discoveries in biology gleaned from modern bioinformatics approaches will have an impact on the way several of these disorders are diagnosed and treated in the near future. This review highlights several of the most relevant applications of bioinformatics in otolaryngology. The goal of this review is to provide a basic understanding of bioinformatics to the clinical otolaryngologist. This paper briefly explains common modern molecular biology techniques, describes bioinformatics approaches for data analysis and interpretation, and provides several contemporary references as examples of these applications in otolaryngology.

The review has been divided into two parts. The first instalment details bioinformatics approaches to high-throughput nucleotide sequencing and gene expression analysis. Many principles involved in the bioinformatic approaches to these two types of data can be applied to other high-throughput molecular biology techniques. The second part of this review summarises several other modern genomic, epigenetic and molecular biology platforms, highlighting the considerations in bioinformatics for each. A glossary of terms specific to molecular biology, genomics and bioinformatics is provided as a reference for the reader (Table I).

Table I Glossary of terms

MRNA = messenger RNA; 2D = two-dimensional; cDNA = complementary DNA

Common bioinformatics applications in ENT

Bioinformatics is an immense field, and a full description of the breadth of applications that fall under the discipline of bioinformatics is well beyond the scope of this article. The first part of this review will focus on next-generation sequencing data and gene expression analysis, as several principles from these two platforms can be applied to other modern high-throughput systems. In the second part of the review, other important methodologies are summarised. The series concludes with a discussion of recent approaches used for integration of these data, and developments anticipated in the near future.

Bioinformatics in DNA sequencing

Over the course of approximately 50 years, our understanding of the human genome has grown exponentially. In 1953, Watson and Crick first reported their discovery of the double-helix structure of DNA, and proposed a mechanism for how heritable information was passed on from generation to generation.Reference Watson and Crick¹ It was not until Frederick Sanger developed the chain-termination method of DNA sequencing in 1977 that a rapid and reliable method of DNA sequencing became feasible.Reference Sanger, Nicklen and Coulson² Sanger sequencing techniques were used to carry out the Human Genome Project,Reference Lander, Linton, Birren, Nusbaum, Zody and Baldwin³^, Reference Venter, Adams, Myers, Li, Mural and Sutton⁴ which was completed at the turn of this century after more than a decade of work across several institutions.

As the Human Genome Project neared completion, the demand for high-throughput sequencing that required less time and cost to produce large amounts of DNA sequence increased dramatically. During the last decade, modern ‘next-generation’ sequencing techniques have evolved. These techniques ‘parallelise’ sequencing reactions on an enormous scale, such that several sequences from smaller fragments of the DNA of interest are sequenced simultaneously and then computationally aligned in order to arrive at the final sequence result.Reference Shendure and Lieberman Aiden⁵^, Reference Mardis⁶ These techniques have now made it possible to sequence an entire human genome in a matter of weeks at a cost that is several orders of magnitude less than that of the Human Genome Project. Details of the methodology of next-generation sequencing are not discussed here, but for the purposes of this review it is important to emphasise that it has become affordable and efficient to produce large amounts of genomic sequencing data in a short time. Analysis and interpretation of these data requires distinct bioinformatics approaches.

Analysis of DNA sequences

Whether one wants to sequence a single fragment of DNA, hundreds of genes at a time, or sequence the entire human exome or genome, the data generated must undergo specific processing and analysis in order to be interpretable and to ultimately produce useful information. Single, small fragments of DNA (of the order of tens to hundreds of base pairs) can be aligned to a specific reference sequence. Several tools exist for this process.

Perhaps the most frequently utilised tool is the Basic Local Alignment Search Tool (‘BLAST’), publically available via the National Center for Biotechnology Information.⁷ This alignment tool was created in 1990,Reference Altschul, Gish, Miller, Myers and Lipman⁸ and remains one of the most highly utilised bioinformatics applications. Using this tool, one can enter a DNA sequence or protein sequence, which is then compared, using a heuristic algorithm, against several databases to create a library of sequence matches. The Basic Local Alignment Search Tool has numerous applications, but one particularly useful function is to align and map a DNA sequence to the reference database for the human genome.

While the Basic Local Alignment Search Tool is useful to evaluate single or small numbers of DNA sequences of short length, more complex sequencing projects, such as those applying next-generation sequencing platforms, require alignment tools that are much more complex and efficient. When analysing large amounts of DNA sequence, the goals are generally two-fold: (1) to obtain the sequence itself, and (2) to identify the position of the examined sequences in a reference genome (e.g. the human genome). Next-generation sequencing techniques typically provide millions of copies of relatively short lengths of sequence reads, which must be assembled into the final target sequence. Sequence assembly can follow a de novo process, or it can be accomplished via alignment to a reference.Reference Metzker⁹ Figure 1 presents a simplified model of these two methods of sequence assembly.

Fig. 1 Simplified schematic of (a) de novo versus (b) alignment sequence assembly.

Traditional alignment programs are interfaced with tools such as the Basic Local Alignment Search Tool. They operate in a manner similar to that described above, but on a much larger scale, using an automated and iterative process. A significant amount of computing power and time is required to accomplish the ultimate alignment.Reference Trapnell and Salzberg¹⁰ Newer programs, such as Maq or Bowtie, use highly efficient algorithms to perform sequence alignment. A summary of these methods as well as other common open-source alignment tools are discussed in a review by Trapnell and Salzberg.Reference Trapnell and Salzberg¹⁰

The alignment method for sequence assembly, rather than a de novo process, is the most commonly used approach for next-generation sequencing applications in medicine. De novo sequencing assembly is a process by which reads are aligned based on overlapping regions of the analysed fragments. Sequences generated with overlapping common regions are ‘aligned’ to piece together the ultimate read for the complete sequence of interest. These methods obviously require assembly programs that perform different algorithms than the alignment tools. De novo assembly is commonly applied to smaller genomes, such as those of bacteria. De novo sequencing of a large target sequence, such as that of the human genome, is still a major challenge.Reference Metzker⁹De novo sequencing methods and tools are further discussed in a review by Nagarajan and Pop.Reference Nagarajan and Pop¹¹

For most DNA sequencing applications in otolaryngology, and medicine in general, the goal is to identify abnormalities in the DNA sequence, namely germline or somatic mutations. Once sequence data are aligned to a reference (for example, the latest build of the human genome), one can generate a catalogue of any differences between the DNA sequences and the reference. These are often referred to as ‘variants’; this generic term is used because these differences can be due to either mutation or known polymorphisms (i.e. normal differences in the DNA sequence that occur with a defined frequency in the population). When localised to one nucleotide, these polymorphisms are referred to as single nucleotide polymorphisms. The term ‘variants’ can also be used to describe changes in the copy number of genomic elements (known as copy number variations) or other structural changes, such as chromosomal translocations.

Variant calling (the process of identifying differences between the experimental sequence and the reference to which the sequence is aligned) might seem simple in principle, but there are several barriers to accurate variant calling when analysing a large volume of sequence reads (for example, when examining an exome or a genome). Differences between the DNA sequence and the reference can be secondary to error or misalignment, which can be influenced by several factors. For example, false variant calls are common when: insertion/deletion mutations (indels) are present; regions of DNA containing a high volume of repeated guanine/cytosine nucleotides (guanine-cytosine rich regions) are evaluated; or there are errors due to polymerase chain reaction artefacts in library construction or poor sequence signal, which is common at the ends of each sequence read.Reference Dolled-Filhart, Lee, Ou-Yang, Haraksingh and Lin¹²

Variant calling for next-generation sequencing data requires alignment, adjustments for quality control and probability algorithms to define the confidence with which a variant is ‘called’. Several software tools exist for this process, such as the Genome Analysis Toolkit,Reference McKenna, Hanna, Banks, Sivachenko, Cibulskis and Kernytsky¹³ VarScan or VarScan2,Reference Koboldt, Chen, Wylie, Larson, McLellan and Mardis¹⁴ SOAPsnp (a member of the Short Oligonucleotide Analysis Package),Reference Li, Li, Kristiansen and Wang¹⁵ and Atlas 2.Reference Challis, Yu, Evani, Jackson, Paithankar and Coarfa¹⁶ The outputs from these analyses are commonly presented in a ‘variant call format’, or ‘VCF’ file, which is a text format that has been developed in order to standardise these data for use in large-scale projects, such as the 1000 Genomes Project. A description of the variant call format can be found on the 1000 Genomes website.¹⁷ Another step in analysing these data, particularly for whole genome or whole exome sequencing, is the determination of regions of minor or major structural variation (e.g. insertions, deletions or break points of translocations), which is much more complicated than the identification of single nucleotide variants. Programs that carry out these processes include Pindel,Reference Ye, Schulz, Long, Apweiler and Ning¹⁸ DindelReference Albers, Lunter, MacArthur, McVean, Ouwehand and Durbin¹⁹ and some features in the Genome Analysis Toolkit.

The subsequent steps after variant calling are filtering and annotation steps. The goal of this part of the process is to identify sequence variants of interest. In otolaryngology research, these would be variants that are likely to cause functional alteration in a protein or phenotypic changes at the cellular level, and those that ultimately contribute to disease.

There are several methods of filtering and characterising genomic variants, and the process is tied to the specific scientific question. For example, if the goal is to identify significant mutations in a tumour sample, one might compare the variants called in the tumour DNA with the DNA from a sample of normal tissue (e.g. blood or adjacent mucosa) from the patient. This would improve the identification of acquired mutations in the tumour sample as opposed to variants that were normal polymorphisms harboured by the genome of the individual. Variants can be referenced to large single nucleotide polymorphism databases, such as the Single Nucleotide Polymorphism Database (‘dbSNP’) on the National Center for Biotechnology Information website²⁰ or the 1000 Genomes ProjectReference Abecasis, Altshuler, Auton, Brooks and Durbin²¹^, Reference Abecasis, Auton, Brooks, DePristo and Durbin²² database,²³ in order to filter out normal polymorphisms known to exist in the population from novel mutations.

The functional consequences of variants identified can be evaluated in a high-throughput fashion using variant annotation tools. These programs rapidly scan all reported variants in a project in order to determine if they occur in coding regions and if the variants cause significant alteration of the amino acid sequence in the translated protein (e.g. synonymous vs non-synonymous mutations). Again, several programs are available to annotate variants, including SIFT (Sorting Intolerant from Tolerant),Reference Kumar, Henikoff and Ng²⁴ PolyPhenReference Adzhubei, Schmidt, Peshkin, Ramensky, Gerasimova and Bork²⁵ and Annovar,Reference Wang, Li and Hakonarson²⁶ though there are many others.

A review by Dolled-Filhart and colleagues summarises the process for analysing next-generation sequencing data, and highlights several additional tools available for alignment, variant calling and annotation.Reference Dolled-Filhart, Lee, Ou-Yang, Haraksingh and Lin¹² Table II provides a list and brief description of several analysis tools that the authors have found useful for the processing and analysis of next-generation sequencing data.

Table II Selected analysis tools for evaluation of dna sequencing data

SNP = single nucleotide polymorphism

To date, next-generation sequencing in otolaryngology has arguably had the largest impact in the fields of cancer biology and congenital deafness. In 2011, two articles reported results from whole exome sequencing of head and neck squamous cell carcinoma (SCC), and each independently discovered that mutations in the gene NOTCH1 frequently occurred in these tumours.Reference Agrawal, Frederick, Pickering, Bettegowda, Chang and Li²⁷^, Reference Stransky, Egloff, Tward, Kostic, Cibulskis and Sivachenko²⁸ Sequence details from both studies showed frequent missense and nonsense mutation, generally occurring upstream of the transmembrane domain of the protein. These studies also confirmed several mutations in genes known to be frequently altered in head and neck SCC (e.g. TP53, CDKN2A),Reference Agrawal, Frederick, Pickering, Bettegowda, Chang and Li²⁷^, Reference Stransky, Egloff, Tward, Kostic, Cibulskis and Sivachenko²⁸ and identified other mutations in novel genes (e.g. CASP8, FAT1).Reference Stransky, Egloff, Tward, Kostic, Cibulskis and Sivachenko²⁸ Since these results were published, results from whole exome sequencing have been reported for both medullary thyroid carcinomaReference Agrawal, Jiao, Sausen, Leary, Bettegowda and Roberts²⁹ and adenoid cystic carcinoma.Reference Ho, Kannan, Roy, Morris, Ganly and Katabi³⁰ Each of these studies used next-generation sequencing to evaluate the whole exome of matched tumour–normal pairs, and used similar bioinformatics approaches to arrive at variant calls unique to these cancer types.

Research has also led to significant discoveries in the genetics of hearing loss. A number of genetic factors have been linked to both syndromic and non-syndromic deafness. Several of these factors are summarised in a recent review by Shearer and Smith.Reference Shearer and Smith³¹ Next-generation sequencing techniques have been utilised to identify novel mutations associated with hearing loss. For example, a recent study using whole exome sequencing identified a novel mutation in COCH in a Chinese family with progressive autosomal dominant hearing loss.Reference Gao, Xue, Chen, Ke, Qi and Liu³² Another study, which evaluated 13 Korean families with autosomal recessive non-syndromic hearing loss, identified frequent variants in the gene MYO15A.Reference Woo, Park, Baek, Park, Kim and Sagong³³ Along with discovery research, next-generation sequencing is also being advocated as a diagnostic tool. These methods have been proposed as means to: identify patients that carry mutations known to be associated with hearing loss, and discover novel variants in known genetic loci that have previously been shown to be critical for hearing.Reference Yan, Tekin, Blanton and Liu³⁴

New discoveries and applications using next-generation sequencing are changing our understanding of human disease, and improving approaches to diagnosis and treatment. Applications in otolaryngology are in their infancy. The methods in bioinformatics described above are crucial for the advancement of translational genomics.

Bioinformatics for analysis of high-throughput gene expression data

The ability to evaluate the global expression of thousands of genes in one experiment has been available for approximately two decades, since the advent of microarray technology.Reference Ye, Schulz, Long, Apweiler and Ning¹⁸^, Reference Schena, Shalon, Davis and Brown³⁵ The standard method for high-throughput gene expression has been via microarray analysis. In these techniques, RNA, usually messenger RNA (mRNA), is either copied or converted to complementary DNA using reverse transcription. These are then fluorescently labelled and hybridised to an array carrying thousands of probes corresponding to specific genes. The arrays are scanned with a laser that causes the hybridised samples to fluoresce; the relative intensity at each probe corresponds to the relative abundance of each transcript. These data can then be used to compare the relative expression of genes that are represented by the probes on the array.

RNA-Seq is a new technology that utilises next-generation sequencing to evaluate RNA expression using a different approach.Reference Wang, Gerstein and Snyder³⁶ With RNA-Seq, RNA is harvested, and the RNA of interest is isolated (e.g. ribosomal RNA is often discarded, or perhaps only mRNA is isolated). The RNA is then reverse transcribed, and the transcripts are sequenced using next-generation techniques. Once the reads are assembled, the ‘transcriptome’ sequences can be aligned to the reference genome, and relative expression can be estimated from the coverage depth of each read.Reference Wang, Gerstein and Snyder³⁶ Subsequent analysis of gene expression microarray data and RNA-Seq transcriptome data requires distinct bioinformatics approaches to glean useful information.

The first step in analysing microarray data is to normalise the data, in order to eliminate measurement noise that is systemic to the platform being utilised for the study. With tens of thousands of signals from each chip analysed, there are several sources of variation: there is normal variation between signals on an individual chip, between data gathered from different chips and between experimental batches.Reference Butte³⁷ Methodologies and programs used to normalise array data are a subject for an entire article. In general, normalisation processes employ several strategies, including methods that use internal controls built into each array (e.g. duplicate probes placed in different locations of an array) and approaches that measure ‘housekeeper’ probes that generally show low variability between samples. Normalisation also considers variation due to batch effect. Several statistical methods (e.g. global normalisation (which equalises the mean or median values for each array in the experiment), parametric linear regression methods and non-parametric methods) aim to adjust the array output values so that differences identified are not due to inherent or experimental variation.Reference Gusnanto, Calza and Pawitan³⁸ There are many approaches and methods that can be employed to normalise microarray data. Reviews by Fan and Ren,Reference Fan and Ren³⁹ and Gusnanto and colleagues,Reference Gusnanto, Calza and Pawitan³⁸ provide further details of some standard methods.

Along with normalisation, a filtering process is often employed. The goal of filtering is to disregard any probes that may not provide meaningful data and would thus add ‘noise’ that could dilute the data of interest.Reference Gusnanto, Calza and Pawitan³⁸ For example, if a fraction of probes show very low expression across all of the samples in the experiment, one may wish to eliminate these from the analysis. Conversely, one may want to eliminate probes that show extreme variation between samples. The filtering process depends on the goals of the experiment; if conducted appropriately, the filtering process can improve the signal-to-noise ratio, which is an inherent problem in microarray research.

After normalisation and filtering, data can be analysed to seek an answer to the experimental question. The methodology for interpreting microarray data is diverse, and largely depends on the goals of the experimental design. The following are some basic principles and approaches.

One starting point is to determine if a supervised or an unsupervised approach would be best. Unsupervised approaches are methods that examine inherent patterns within the entire dataset without outside bias. Unsupervised methods include feature determination (a common technique is called a principle component analysis), cluster determination and network determination.Reference Butte³⁷ Supervised analyses are methods that are used to determine which genes or groups of genes on the array can best differentiate between pre-determined groups (e.g. tumour tissue vs normal tissue, treated vs untreated). Statistical methods to perform a supervised analysis range from approaches that are fairly simple, such as basic parametric or non-parametric comparisons between the mean or median values of each probe between each group, to extremely complex, such as the use of support vector machines or Bayesian methods to segregate groups.Reference Butte³⁷ The authors point the reader to the concise review by Butte,Reference Butte³⁷ who highlights some common supervised and unsupervised methods for gene expression array analysis.

One concept that is important to discuss is that of the ‘false discovery rate’. This is a measure of error, which was developed approximately two decades ago coinciding with the increased application of microarray technology.Reference Benjamini and Hochberg⁴⁰ In brief, when one performs multiple statistical tests in an iterative fashion, there exists an inherent error with each test performed (generating a p-value for each individual test, which describes the level of confidence that the result was not due to a type 1 error). As more tests are performed, the chances of ‘discovering’ a positive result simply by chance increases with each additional test. Therefore, the stringency for calling ‘true positives’ over ‘false positives’ should increase as the number of tests increases. Traditional methods for adjusting the accepted error rate (namely, the family-wise error rate, the most common method of which is the Bonferroni correction) are very stringent. When applied to the thousands of data points generated with microarray data, these methods have a high likelihood of dismissing pertinent results. The false discovery rate is a less stringent correction than family-wise error rate methods, and is commonly reported as a q-value, which can be interpreted as similar to a p-value generated in statistics and used for single-hypothesis testing.

The steps highlighted above involve complex bioinformatics approaches to arrive at the end result in microarray analysis; namely, a list of genes corresponding to array probes that appear to be associated with the experimental condition of interest. These could be genes that cluster together in an unsupervised analysis, forming what appear to be biologically distinct entities, or genes that are differentially expressed between pre-defined categories (i.e. in a supervised analysis, groups are predetermined and genes that are expressed differently between the different categories are then identified). An example of a signature generated from a supervised analysis would be a list of genes which are differentially expressed in a set of tumours that respond to a specific therapy versus a set of tumours that do not respond.

Whether attained via unsupervised or supervised analyses, the results of these studies generate a list of genes of interest, which must subsequently be validated both internally and externally. Internal validation examines the expression of individual genes in the original test samples (with reverse-transcription polymerase chain reaction, for example). This is done in order to verify that what was seen on the array was truly from gene expression levels in the samples, as opposed to artefacts on the array or artefacts secondary to experimental methods. Additionally, the results should be validated externally; findings in the test set must often be verified on an additional sample set (e.g. a ‘repeated’ experiment on an independent cohort) in order to deem the results generalisable.

One important concept to mention is the problem of ‘overfitting’ when developing a gene expression signature. Because multiple variables are tested while developing these signatures (i.e. comparisons between thousands of probes), there is a relatively high likelihood that a statistical model generated from the dataset to describe a given cohort will only be applicable to the experimental cohort; the model is therefore ‘overfitted’ internally. This is the main reason why external validation is a required part of the process in generating gene expression signatures. The difficulty in replicating and validating results generated from gene expression microarray experiments is arguably one of the greatest barriers to the progress of these research endeavours.

Gene expression analysis has been applied in several areas of otolaryngology. By far the widest application has been in the field of head and neck cancer biology. A number of interesting bioinformatics approaches have been implemented in these studies. For example, in an early study by Chung et al., 60 tumour samples were examined using complementary DNA gene expression arrays.Reference Chung, Parker, Karaca, Wu, Funkhouser and Moore⁴¹ First, a pooled subset of 30 randomly selected samples from the test set was used as the reference (early generation gene expression arrays used a two-colour fluorescence system to compare a test sample to a reference sample). The gene expression data were normalised and filtered in an interesting manner. As part of the filtering process, in a subset of 10 samples, RNA was extracted from 2 different regions of the same tumour. When these paired samples were examined, genes that showed little intrinsic variance (i.e. variance between matched pairs) but high variance across unmatched samples were selected for further evaluation. An unsupervised hierarchical clustering analysis identified four distinct subtypes of tumours, which appeared to have distinct molecular characteristics. A supervised analysis using two specific statistical methods (k-nearest neighbour method and prediction analysis of microarrays) was carried out to develop a model that was 80 per cent accurate at predicting pathological lymph node status.Reference Chung, Parker, Karaca, Wu, Funkhouser and Moore⁴¹

In another early study, Belbin and colleagues examined 17 head and neck SCC samples.Reference Belbin, Singh, Barber, Socci, Wenig and Smith⁴² A gene expression profile derived using an unsupervised analysis was predictive for overall survival within the patient set.

In more recent examples of gene expression studies in head and neck SCC, Schlecht and colleagues used a supervised analysis of gene expression microarray data to demonstrate that human papilloma virus (HPV) positive and HPV-negative head and neck SCC have distinct gene expression profiles.Reference Shlecht, Burk, Adrien, Dunne, Kawachi and Sarta⁴³ Two studies examining HPV-negative oral cavity SCC demonstrated that a gene signature could be a better predictor of survival than clinicopathological factors.Reference Mendez, Houck, Doody, Fan, Lohavanichbutr and Rue⁴⁴^, Reference Lohavanichbutr, Mendez, Holsinger, Rue, Zhang and Houck⁴⁵ In the latter study, the authors notably developed a predictive gene expression signature using an iterative supervised method on a 97-patient training set, and validated the result on an external dataset.Reference Lohavanichbutr, Mendez, Holsinger, Rue, Zhang and Houck⁴⁵ These examples highlight how the bioinformatics principles described in this review have been employed in head and neck cancer research.

Gene expression has not been limited to cancer biology alone. A study by Klenke and colleagues compared gene expression in cholesteatoma with that in ear canal skin, which demonstrated expression patterns consistent with chronic inflammation and characteristics similar to those of invasive tumours.Reference Klenke, Janowski, Borck, Widera, Ebmeyer and Kalinowski⁴⁶ A study by Stankovic et al. reported a distinct transcriptional signature in nasal polyps associated with chronic sinusitis, compared with those in patients with aspirin-sensitive asthma.Reference Stankovic, Goldsztein, Reh, Platt and Metson⁴⁷ Another interesting study, by Vambutas and colleagues, used gene expression arrays to evaluate peripheral blood mononuclear cells extracted and cultured from patients with autoimmune hearing loss.Reference Vambutas, DeVoti, Goldofsky, Gordon, Lesser and Bonagura⁴⁸ That study revealed that stimulation of cultured peripheral blood mononuclear cells with perilymph extracted at the time of cochlear implantation led to differential expression of interleukin 1 receptor type II.Reference Vambutas, DeVoti, Goldofsky, Gordon, Lesser and Bonagura⁴⁸ These studies demonstrate the wide potential applications of gene expression in otolaryngology, and these studies exemplify both the creative approaches and the bioinformatics challenges in analysing and applying these data.

The bioinformatics approaches to analysing gene expression data are complex. Figure 2 summarises the basic approaches to gene expression analysis and lists some tools that the authors have found useful.

Fig. 2 Steps for processing gene expression data, and list of selected tools for normalisation, filtering and analysis. PCA = principle component analysis

Conclusion

As evidenced by this review, high-throughput nucleotide sequencing and transcriptomics are changing biology and medicine. Gleaning useful information from these vast data sources requires sound bioinformatics approaches. Several of the bioinformatics principles described for next-generation sequencing and gene expression analysis are applicable to other high-throughput molecular biology techniques. The next part of this review highlights several other high-throughput platforms, and discusses recent approaches that seek to integrate data from multi-platform projects.

References

1Watson, JD, Crick, FH. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 1953;171:737–8CrossRef Google Scholar PubMed

2Sanger, F, Nicklen, S, Coulson, AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 1977;74:5463–7Google Scholar

3Lander, ES, Linton, LM, Birren, B, Nusbaum, C, Zody, MC, Baldwin, J et al. Initial sequencing and analysis of the human genome. Nature 2001;409:860–921Google Scholar PubMed

4Venter, JC, Adams, MD, Myers, EW, Li, PW, Mural, RJ, Sutton, GG et al. The sequence of the human genome. Science 2001;291:1304–51CrossRef Google Scholar PubMed

5Shendure, J, Lieberman Aiden, E. The expanding scope of DNA sequencing. Nat Biotechnol 2012;30:1084–94Google Scholar

6Mardis, ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 2008;9:387–402CrossRef Google Scholar PubMed

7Basic Local Alignment Search Tool (BLAST). In: http://blast.ncbi.nlm.nih.gov/Blast.cgi [22 September 2013]Google Scholar

8Altschul, SF, Gish, W, Miller, W, Myers, EW, Lipman, DJ. Basic local alignment search tool. J Mol Biol 1990;215:403–10Google Scholar

9Metzker, ML. Sequencing technologies – the next generation. Nat Rev Genet 2010;11:31–46CrossRef Google Scholar PubMed

10Trapnell, C, Salzberg, SL. How to map billions of short reads onto genomes. Nat Biotechnol 2009;27:455–7Google Scholar

11Nagarajan, N, Pop, M. Sequence assembly demystified. Nat Rev Genet 2013;14:157–67Google Scholar

12Dolled-Filhart, MP, Lee, M Jr, Ou-Yang, CW, Haraksingh, RR, Lin, JC. Computational and bioinformatics frameworks for next-generation whole exome and genome sequencing. ScientificWorldJournal 2013;2013:730210CrossRef Google Scholar PubMed

13McKenna, A, Hanna, M, Banks, E, Sivachenko, A, Cibulskis, K, Kernytsky, A et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303CrossRef Google Scholar PubMed

14Koboldt, DC, Chen, K, Wylie, T, Larson, DE, McLellan, MD, Mardis, ER et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 2009;25:2283–5Google Scholar

15Li, R, Li, Y, Kristiansen, K, Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 2008;24:713–14Google Scholar

16Challis, D, Yu, J, Evani, US, Jackson, AR, Paithankar, S, Coarfa, C et al. An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 2012;13:8CrossRef Google Scholar PubMed

171000 Genomes. VCF (Variant Call Format) version 4.1. In: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 [22 September 2013]Google Scholar

18Ye, K, Schulz, MH, Long, Q, Apweiler, R, Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009;25:2865–71Google Scholar

19Albers, CA, Lunter, G, MacArthur, DG, McVean, G, Ouwehand, WH, Durbin, R. Dindel: accurate indel calls from short-read data. Genome Res 2011;21:961–73Google Scholar

20dbSNP home page. In: http://www.ncbi.nlm.nih.gov/projects/SNP/ [2 September 2013]Google Scholar

211000 Genomes Project Consortium, Abecasis, GR, Altshuler, D, Auton, A, Brooks, LD, Durbin, RM et al. A map of human genome variation from population-scale sequencing. Nature 2010;467:1061–73Google Scholar

221000 Genomes Project Consortium, Abecasis, GR, Auton, A, Brooks, LD, DePristo, MA, Durbin, RM et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56–65Google Scholar

231000 Genomes FTP directory. In: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/ [2 September 2013]Google Scholar

24Kumar, P, Henikoff, S, Ng, PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 2009;4:1073–81CrossRef Google Scholar PubMed

25Adzhubei, IA, Schmidt, S, Peshkin, L, Ramensky, VE, Gerasimova, A, Bork, P et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7:248–9CrossRef Google Scholar PubMed

26Wang, K, Li, M, Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38:e164Google Scholar

27Agrawal, N, Frederick, MJ, Pickering, CR, Bettegowda, C, Chang, K, Li, RJ et al. Exome sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1. Science 2011;333:1154–7Google Scholar

28Stransky, N, Egloff, AM, Tward, AD, Kostic, AD, Cibulskis, K, Sivachenko, A et al. The mutational landscape of head and neck squamous cell carcinoma. Science 2011;333:1157–60Google Scholar

29Agrawal, N, Jiao, Y, Sausen, M, Leary, R, Bettegowda, C, Roberts, NJ et al. Exomic sequencing of medullary thyroid cancer reveals dominant and mutually exclusive oncogenic mutations in RET and RAS. J Clin Endocrinol Metab 2013;98:E364–9CrossRef Google Scholar PubMed

30Ho, AS, Kannan, K, Roy, DM, Morris, LG, Ganly, I, Katabi, N et al. The mutational landscape of adenoid cystic carcinoma. Nat Genet 2013;45:791–8CrossRef Google Scholar PubMed

31Shearer, AE, Smith, RJ. Genetics: advances in genetic testing for deafness. Curr Opin Pediatr 2012;24:679–86Google Scholar

32Gao, J, Xue, J, Chen, L, Ke, X, Qi, Y, Liu, Y. Whole exome sequencing identifies a novel DFNA9 mutation, C162Y. Clin Genet 2013;83:477–81CrossRef Google Scholar PubMed

33Woo, HM, Park, HJ, Baek, JI, Park, MH, Kim, UK, Sagong, B et al. Whole-exome sequencing identifies MYO15A mutations as a cause of autosomal recessive nonsyndromic hearing loss in Korean families. BMC Med Genet 2013;14:72Google Scholar

34Yan, D, Tekin, M, Blanton, SH, Liu, XZ. Next-generation sequencing in genetic hearing loss. Genet Test Mol Biomarkers 2013;17:581–7CrossRef Google Scholar PubMed

35Schena, M, Shalon, D, Davis, RW, Brown, PO. Quantitative monitoring of gene expression patterns with a complementary DNA. Science 1995;270:467–70Google Scholar

36Wang, Z, Gerstein, M, Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10:57–63Google Scholar

37Butte, A. The use and analysis of microarray data. Nat Rev Drug Discov 2002;1:951–60Google Scholar

38Gusnanto, A, Calza, S, Pawitan, Y. Identification of differentially expressed genes and false discovery rate in microarray studies. Curr Opin Lipidol 2007;18:187–93Google Scholar

39Fan, J, Ren, Y. Statistical analysis of DNA microarray data in cancer research. Clin Cancer Res 2006;12:4469–73Google Scholar

40Benjamini, Y, Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995;57:289–300Google Scholar

41Chung, CH, Parker, JS, Karaca, G, Wu, J, Funkhouser, WK, Moore, D et al. Molecular classification of head and neck squamous cell carcinomas using patterns of gene expression. Cancer Cell 2004;5:489–500Google Scholar

42Belbin, TJ, Singh, B, Barber, I, Socci, N, Wenig, B, Smith, R et al. Molecular classification of head and neck squamous cell carcinoma using cDNA microarrays. Cancer Res 2002;62:1184–90Google Scholar PubMed

43Shlecht, NF, Burk, RD, Adrien, L, Dunne, A, Kawachi, N, Sarta, C et al. Gene expression profiles in HPV-infected head and neck cancer. J Pathol 2007;213:283–93Google Scholar

44Mendez, E, Houck, JR, Doody, DR, Fan, W, Lohavanichbutr, P, Rue, TC et al. A genetic expression profile associated with oral cancer identifies a group of patients at high risk of poor survival. Clin Cancer Res 2009;15:1353–61Google Scholar

45Lohavanichbutr, P, Mendez, E, Holsinger, FC, Rue, TC, Zhang, Y, Houck, J et al. A 13-gene signature prognostic of HPV-negative OSCC: discovery and external validation. Clin Cancer Res 2013;19:1197–203CrossRef Google Scholar PubMed

46Klenke, C, Janowski, S, Borck, D, Widera, D, Ebmeyer, J, Kalinowski, J et al. Identification of novel cholesteatoma-related gene expression signatures using full-genome microarrays. PloS One 2012;7:e52718Google Scholar

47Stankovic, KM, Goldsztein, H, Reh, DD, Platt, MP, Metson, R. Gene expression profiling of nasal polyps associated with chronic sinusitis and aspirin-sensitive asthma. Laryngoscope 2008;118:881–9Google Scholar

48Vambutas, A, DeVoti, J, Goldofsky, E, Gordon, M, Lesser, M, Bonagura, V. Alternate splicing of interleukin-1 receptor type II (IL1R2) in vitro correlates with clinical glucocorticoid responsiveness in patients with AIED. PloS One 2009;4:e5293Google Scholar

49TM4 Microarray Software Suite. In: http://www.tm4.org/ [6 August 2014]Google Scholar

50SNOMAD: Standardization and NOrmalization of MicroArray Data. In: http://pevsnerlab.kennedykrieger.org/snomad.php [6 August 2014]Google Scholar

Table I Glossary of terms

Fig. 1 Simplified schematic of (a) de novo versus (b) alignment sequence assembly.

Table II Selected analysis tools for evaluation of dna sequencing data

Fig. 2 Steps for processing gene expression data, and list of selected tools for normalisation, filtering and analysis. PCA = principle component analysis

Article contents

Bioinformatics in otolaryngology research. Part one: concepts in DNA sequencing and gene expression analysis

Abstract

Keywords

Introduction

Common bioinformatics applications in ENT

Bioinformatics in DNA sequencing

Analysis of DNA sequences

Bioinformatics for analysis of high-throughput gene expression data

Conclusion

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests