Book contents
- Frontmatter
- Contents
- Preface
- Acknowledgments
- 1 The Central Dogma
- 2 RNA Secondary Structure
- 3 Comparing DNA Sequences
- 4 Predicting Species: Statistical Models
- 5 Substitution Matrices for Amino Acids
- 6 Sequence Databases
- 7 Local Alignment and the BLAST Heuristic
- 8 Statistics of BLAST Database Searches
- 9 Multiple Sequence Alignment I
- 10 Multiple Sequence Alignment II
- 11 Phylogeny Reconstruction
- 12 Protein Motifs and PROSITE
- 13 Fragment Assembly
- 14 Coding Sequence Prediction with Dicodons
- 15 Satellite Identification
- 16 Restriction Mapping
- 17 Rearranging Genomes: Gates and Hurdles
- A Drawing RNA Cloverleaves
- B Space-Saving Strategies for Alignment
- C A Data Structure for Disjoint Sets
- D Suggestions for Further Reading
- Bibliography
- Index
13 - Fragment Assembly
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Preface
- Acknowledgments
- 1 The Central Dogma
- 2 RNA Secondary Structure
- 3 Comparing DNA Sequences
- 4 Predicting Species: Statistical Models
- 5 Substitution Matrices for Amino Acids
- 6 Sequence Databases
- 7 Local Alignment and the BLAST Heuristic
- 8 Statistics of BLAST Database Searches
- 9 Multiple Sequence Alignment I
- 10 Multiple Sequence Alignment II
- 11 Phylogeny Reconstruction
- 12 Protein Motifs and PROSITE
- 13 Fragment Assembly
- 14 Coding Sequence Prediction with Dicodons
- 15 Satellite Identification
- 16 Restriction Mapping
- 17 Rearranging Genomes: Gates and Hurdles
- A Drawing RNA Cloverleaves
- B Space-Saving Strategies for Alignment
- C A Data Structure for Disjoint Sets
- D Suggestions for Further Reading
- Bibliography
- Index
Summary
In Chapter 3, we introduced the sequence assembly problem and discussed how to determine whether two fragments of a DNA sequence “overlap” when each may contain errors. In this chapter, we will consider how to use overlap information for a set of fragments to reassemble the original sequence. Of course, there is an infinite number of sequences that could give rise to a particular set of fragments. Our goal is to produce a reconstruction of the unknown source sequence that is reasonably likely to approximate the actual source well.
In Section 13.1 we will consider a model known as the shortest common superstring problem. This simple model assumes that our fragments contain no errors and that a short source sequence is more likely than a longer one. If the input set reflects the source exactly, then only exact overlaps between fragments are relevant to reconstruction. This approach's focus on exact matching allows us to use once more that extremely versatile data structure from Chapter 12, the suffix tree.
Beginning in Section 13.2, we consider another solution, a simplified version of the commonly used PHRAP program. PHRAP assumes that errors are present in the sequence files and furthermore that each base of each sequence is annotated with a quality value describing the level of confidence the sequencing machinery had in its determination of the base. It will not be possible to present all of the issues and options dealt with by PHRAP, but in Sections 13.3–13.7 we will present several important aspects of its approach in a Perl “mini-PHRAP”.
- Type
- Chapter
- Information
- Genomic PerlFrom Bioinformatics Basics to Working Code, pp. 196 - 230Publisher: Cambridge University PressPrint publication year: 2002