Earlier Applications of HiddenMarkov Chain Models

John van der Hoek; Robert J. Elliott

doi:10.1017/9781108377423.017

Introduction

In this appendix some earlier application methods are briefly described.

Markov chain models can be used to provide probability models for sequences of symbols. This will aid in genome annotation. The types of questions that can be asked include the following: Does a particular sequence belong to a particular family and what can one say about its internal structure? How can one discriminate between two sequences?

Some general reviews are given in (Durbin et al., 1998, Chapters 2 and 3), (Robin et al., 2005, Chapters 1 and 2), but a more detailed review of observed Markov chains is provided by (Koski, 2001, Chapter 9). We have added some extra details to Koski's treatment.

A straightforward application of Markov chains to genome sequencing. This approach does not seem to work for the following reasons:

• The four bases A, T, G, C are not uniformly distributed in a sequence and the compositions vary within and between sequences.

• Various k-tuples of bases are not uniformly distributed. However, exons and introns are often separated on the basis of dinucleotide frequencies.

• It seems that higher-order chains need to be used as probabilities of a base in a particular location and then can depend not only on the immediately adjacent bases. In addition, the base composition can vary from one segment to another. The segmentation techniques for decomposing DNA sequences into homogeneous segments includes hidden Markov models.

Frame-dependent Markov chains. These use the GeneMark software; information can be found at

http://genemark.biology.gatech.edu/GeneMark/gm_info.html

Mixture transition distribution chain of order k. These are called MTD(k) models. For a Markov chain of order k with a state-space of size N, there are (N − 1)Nk entries in the transition matrix A to be estimated, (the column sums of A are 1), plus the initial probabilities. With N = 4 and k = 8, we have 3 ・ 48 = 196, 608 which is quite large. This has a further implication that we may not have enough data to calibrate all these entries in A. We comment on estimation using sparse data below.

Book contents

Appendix F - Earlier Applications of HiddenMarkov Chain Models

Summary

Access options

Book contents

Appendix F - Earlier Applications of HiddenMarkov Chain Models

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive