Distribution of Clump Statistics for a Collection of Words

Donald E. K. Martin; Deidra A. Coleman

doi:10.1239/jap/1324046018

Distribution of Clump Statistics for a Collection of Words

Part of: Markov processes Distribution theory - Probability

Published online by Cambridge University Press: 14 July 2016

Donald E. K. Martin and

Deidra A. Coleman

Show author details

Donald E. K. Martin*: Affiliation:
North Carolina State University
Deidra A. Coleman*: Affiliation:
North Carolina State University
*: ∗ Postal address: Department of Statistics, North Carolina State University, 4272 SAS Hall, Raleigh, NC 27695-8203, USA.
∗ Postal address: Department of Statistics, North Carolina State University, 4272 SAS Hall, Raleigh, NC 27695-8203, USA.

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

We give an efficient method based on minimal deterministic finite automata for computing the exact distribution of the number of occurrences and coverage of clumps (maximal sets of overlapping words) of a collection of words. In addition, we compute probabilities for the number of h-clumps, word groupings where gaps of a maximal length h between occurrences of words are allowed. The method facilitates the computation of p-values for testing procedures. A word is allowed to contain other words of the collection, making the computation more general, but also more difficult. The underlying sequence is assumed to be Markovian of an arbitrary order.

Keywords

Clumps of a pattern coarsest partition deterministic finite automaton

MSC classification

Secondary: 60E05: Distributions 60J05: Discrete-time Markov processes on general state spaces

Type: Research Papers
Information: Journal of Applied Probability , Volume 48 , Issue 4 , December 2011 , pp. 1049 - 1059

DOI: https://doi.org/10.1239/jap/1324046018 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2011

References

[1] Aho, A. V. and Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340.Google Scholar

[2] Aston, J. A. D. and Martin, D. E. K. (2005). Waiting time distributions of competing patterns in higher-order Markovian sequences. J. Appl. Prob. 42, 977–988.Google Scholar

[3] Balakrishnan, N. and Koutras, M. V. (2002). Runs and Scans with Applications. John Wiley, New York.Google Scholar

[4] Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580.Google Scholar

[5] Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Prob. 19, 521–545.Google Scholar

[6] Hopcroft, J. (1971). An n log n algorithm for minimizing states in a finite automaton. In Theory of Machines and Computations, eds Kohavi, Z. and Paz, A., Academic Press, New York, pp. 189–196.Google Scholar

[7] Kosoresow, A. P. and Hofmeyr, S. A. (1997). Intrusion detection via system call traces. IEEE Software 14, 35–42.Google Scholar

[8] Ledent, S. and Robin, S. (2005). Checking homogeneity of motifs' distribution in heterogenous sequences. J. Comput. Biol. 12, 672–685.Google Scholar

[9] Lladser, M. E., Betterton, M. D. and Knight, R. (2008). Multiple pattern matching: a Markov chain approach. J. Math. Biol. 56, 51–92.Google Scholar

[10] Marshall, T. and Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics. In Combinatorial Pattern Matching (Lecture Notes Comput. Sci. 5029), Springer, Berlin, pp. 95–106.Google Scholar

[11] Martin, D. E. K. and Aston, J. A. D. (2008). Waiting time distribution of generalized later patterns. Comput. Statist. Data Anal. 52, 4879–4890.Google Scholar

[12] Nuel, G. (2007). Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata. J. Appl. Prob. 45, 226–243.Google Scholar

[13] Reinert, G., Schbath, S. and Waterman, M. S. (2005). Statistics on words with applications to biological sequences. In Applied Combinatorics on Words, eds Berstel, J. and Perrin, D., Cambridge University Press, pp. 268–352.Google Scholar

[14] Ribeca, P. and Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: a DFA-only approach. Bioinformatics 24, 2839–2848.Google Scholar

[15] Robin, S., Rodolphe, F. and Schbath, S. (2005). DNA, Words and Models. Cambridge University Press.Google Scholar

[16] Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist. 1, 1–16.Google Scholar

[17] Stefanov, V. T., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Appl. Math. 155, 868–880.Google Scholar

[18] Tewari, A., Srivastava, U. and Gupta, P. (2002). A parallel DFA minimization algorithm. In High Performance Computing (Lecture Notes Comput. Sci. 2552), Springer, Berlin, pp. 34–40.Google Scholar

Article contents

Distribution of Clump Statistics for a Collection of Words

Abstract

Keywords

MSC classification

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests