Hostname: page-component-78c5997874-xbtfd Total loading time: 0 Render date: 2024-11-09T19:46:33.516Z Has data issue: false hasContentIssue false

Pattern Markov Chains: Optimal Markov Chain Embedding Through Deterministic Finite Automata

Published online by Cambridge University Press:  14 July 2016

Grégory Nuel*
Affiliation:
University of Evry
*
Current address: MAP5, UMR CNRS 8145, Université Paris Descartes, 45 rue des Saints-Pères, F-75006 Paris, France. Email address: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

In the framework of patterns in random texts, the Markov chain embedding techniques consist of turning the occurrences of a pattern over an order-m Markov sequence into those of a subset of states into an order-1 Markov chain. In this paper we use the theory of language and automata to provide space-optimal Markov chain embedding using the new notion of pattern Markov chains (PMCs), and we give explicit constructive algorithms to build the PMC associated to any given pattern problem. The interest of PMCs is then illustrated through the exact computation of P-values whose complexity is discussed and compared to other classical asymptotic approximations. Finally, we consider two illustrative examples of highly degenerated pattern problems (structured motifs and PROSITE signatures), which further illustrate the usefulness of our approach.

MSC classification

Type
Research Article
Copyright
Copyright © Applied Probability Trust 2008 

References

Antzoulakos, D. L. (2001). Waiting times for patterns in a sequence of multistate trials. J. Appl. Prob. 38, 508518.CrossRefGoogle Scholar
Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Prob. 19, 521545.CrossRefGoogle Scholar
Chadjiconstantinidis, S., Antzoulakos, D. L. and Koutras, M. V. (2000). Joint distribution of successes, failures and patterns in enumeration problems. Adv. Appl. Prob. 32, 866884.CrossRefGoogle Scholar
Chryssaphinou, O. and Papastavridis, S. (1990). The occurrence of a sequence of patterns in repeated dependent experiments. Theory Prob. Appl. 35, 167173.Google Scholar
Crochemore, M. and Hancart, C. (1997). Automata for matching patterns. In Handbook of Formal Languages, Vol. 2, Linear Modeling: Background and Application, Springer, Berlin, pp. 399462.CrossRefGoogle Scholar
Crochemore, M. and Stefanov, V. T. (2003). Waiting time and complexity for matching patterns with automata. Inf. Proc. Lett. 87, 119125.CrossRefGoogle Scholar
Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multi-state trials. Statistica Sinica 6, 957974.Google Scholar
Fu, J. C. and Chang, Y. M. (2002). On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. J. Appl. Prob. 30, 183208.Google Scholar
Fu, J. C. and Koutras, M. V. (1994). Distribution theory of runs: a Markov chain approach. J. Amer. Statist. Assoc. 89, 10501058.CrossRefGoogle Scholar
Fu, J. C. and Lou, W. Y. W. (2003). Distribution Theory of Runs and Patterns and Its Applications: A Finite Markov Chain Approach. World Scientific, Singapore.CrossRefGoogle Scholar
Gasteiger, E., Jung, E. and Bairoch, A. (2001). SWISS-PROT: Connecting biological knowledge via a protein database. Current Issues Molec. Biol. 3, 4755.Google Scholar
Glaz, J., Kulldorff, M., Pozdnyakov, V. and Steele, J. M. (2006). Gambling teams and waiting times for patterns in two-state Markov chains. J. Appl. Prob. 43, 127140.CrossRefGoogle Scholar
Guibas, L. J. and Odlyzko, A. M. (1981). String overlaps, pattern matching and transitive games. J. Combinatorial Theory A 30, 183208.CrossRefGoogle Scholar
Hopcroft, J. E., Motwani, R. and Ullman, J. D. (2001). Introduction to Automata Theory, Languages, and Computation, 2nd edn. ACM Press, New York.Google Scholar
Hulo, N. et al. (2006). The PROSITE database. Nucleic Acid Res. 34, D227D230.CrossRefGoogle ScholarPubMed
Lehoucq, R. B., Sorensen, D. C. and Yang, C. (1998). ARPACK Users' Guide. Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA.CrossRefGoogle Scholar
Li, S.-Y. R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Prob. 8, 11711176.CrossRefGoogle Scholar
Lou, W. Y. W. (1996). On runs and longest run tests: a method of finite Markov chain imbedding. J. Amer. Statist. Assoc. 91, 15951601.CrossRefGoogle Scholar
Marsan, L. and Sagot, M.-F. (2000). Algorithms for extracting structured motifs using a suffix tree with an application to promoter consensus identification. J. Comput. Biol. 7, 345362.CrossRefGoogle ScholarPubMed
Nicodeme, P., Salvy, B. and Flajolet, P. (2002). Motifs statistics. Theoret. Comput. Sci. 28, 593617.CrossRefGoogle Scholar
Nuel, G. (2006a). Effective p-value computations using finite Markov chain imbedding (FMCI): application to local score and to pattern statistics. Algo. Molec. Biol. 1.Google ScholarPubMed
Nuel, G. (2006b). Numerical solutions for patterns statistics on Markov chains. Statist. Appl. Gen. Molec. Biol. 5. Google ScholarPubMed
Nuel, G. (2007). Cumulative distribution function of a geometric Poisson distribution. J. Statist. Comput. Simul. 77.Google Scholar
Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words, an overview. J. Comput. Biology 7, 146.CrossRefGoogle ScholarPubMed
Robin, S. and Daudin, J.-J. (1999). Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179193.CrossRefGoogle Scholar
Robin, S. and Daudin, J.-J. (2001). Exact distribution of the distances between any occurrences of a set of words. Ann. Inst. Statist. Math. 36, 895905.CrossRefGoogle Scholar
Robin, S. et al. (2002). Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 9, 761773.CrossRefGoogle ScholarPubMed
Stefanov, V. T. (2000). On some waiting time problems. J. Appl. Prob. 37, 756764.CrossRefGoogle Scholar
Stefanov, V. T. (2003). The intersite distances between pattern occurrences in strings generated by general discrete- and continuous-time models: an algorithmic approach. J. Appl. Prob. 40, 881892.CrossRefGoogle Scholar
Stefanov, V. T. and Pakes, A. G. (1997). Explicit distributional results in pattern formation. Ann. Appl. Prob. 7, 666678.Google Scholar
Stefanov, V. T. and Pakes, A. G. (1999). Explicit distributional results in pattern formation. II. Austral. N. Z. J. Statist. 41, 7990, 254.CrossRefGoogle Scholar
Stefanov, V. T., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Appl. Math. 155, 868880.CrossRefGoogle Scholar