Book contents
- Frontmatter
- Contents
- Preface
- Acknowledgments
- 1 The Central Dogma
- 2 RNA Secondary Structure
- 3 Comparing DNA Sequences
- 4 Predicting Species: Statistical Models
- 5 Substitution Matrices for Amino Acids
- 6 Sequence Databases
- 7 Local Alignment and the BLAST Heuristic
- 8 Statistics of BLAST Database Searches
- 9 Multiple Sequence Alignment I
- 10 Multiple Sequence Alignment II
- 11 Phylogeny Reconstruction
- 12 Protein Motifs and PROSITE
- 13 Fragment Assembly
- 14 Coding Sequence Prediction with Dicodons
- 15 Satellite Identification
- 16 Restriction Mapping
- 17 Rearranging Genomes: Gates and Hurdles
- A Drawing RNA Cloverleaves
- B Space-Saving Strategies for Alignment
- C A Data Structure for Disjoint Sets
- D Suggestions for Further Reading
- Bibliography
- Index
14 - Coding Sequence Prediction with Dicodons
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Preface
- Acknowledgments
- 1 The Central Dogma
- 2 RNA Secondary Structure
- 3 Comparing DNA Sequences
- 4 Predicting Species: Statistical Models
- 5 Substitution Matrices for Amino Acids
- 6 Sequence Databases
- 7 Local Alignment and the BLAST Heuristic
- 8 Statistics of BLAST Database Searches
- 9 Multiple Sequence Alignment I
- 10 Multiple Sequence Alignment II
- 11 Phylogeny Reconstruction
- 12 Protein Motifs and PROSITE
- 13 Fragment Assembly
- 14 Coding Sequence Prediction with Dicodons
- 15 Satellite Identification
- 16 Restriction Mapping
- 17 Rearranging Genomes: Gates and Hurdles
- A Drawing RNA Cloverleaves
- B Space-Saving Strategies for Alignment
- C A Data Structure for Disjoint Sets
- D Suggestions for Further Reading
- Bibliography
- Index
Summary
Once a new segment of DNA is sequenced and assembled, researchers are usually most interested in knowing what proteins, if any, it encodes. We have learned that much DNA does not encode proteins: some encodes catalytic RNAs, some regulates the rate of production of proteins by varying the ease with which transcriptases or ribosomes bind to coding sequences, and much has no known function. If study of proteins is the goal, how can their sequences be extracted from the DNA? This question is the main focus of gene finding or gene prediction.
One approach is to look for open reading frames (ORFs). An open reading frame is simply a sequence of codons beginning with ATG for methionine and ending with one of the stop codons TAA, TGA, or TAG. To gain confidence that an ORF really encodes a gene, we can translate it and search for homologous proteins in a protein database. However, there are several difficulties with this method.
It is ineffective in eukaryotic DNA, in which coding sequences for a single gene are interrupted by introns.
It is ineffective when the coding sequence extends beyond either end of the available sequence.
Random DNA contains many short ORFs that don't code for proteins. This is because one of every 64 random codons codes for M and three of every 64 are stop codons.
The proteins it detects will probably not be that interesting since they will be very similar to proteins with known functions.
- Type
- Chapter
- Information
- Genomic PerlFrom Bioinformatics Basics to Working Code, pp. 231 - 244Publisher: Cambridge University PressPrint publication year: 2002