Exact distribution of word counts in shuffled sequences

Einar Andreas Rødland

doi:10.1239/aap/1143936143

Exact distribution of word counts in shuffled sequences

Part of: Combinatorics Distribution theory Markov processes Combinatorial probability

Published online by Cambridge University Press: 01 July 2016

Einar Andreas Rødland

Show author details

Einar Andreas Rødland*: Affiliation:
Rikshospitalet–Radiumhospitalet HF, University of Oslo
*: ∗ Postal address: Institute of Medical Microbiology, Centre for Molecular Biology and Neuroscience, University of Oslo, Rikshospitalet–Radiumhospitalet HF, N-0027 Oslo, Norway. Email address: [email protected]

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

In DNA sequences, specific words may take on biological functions as marker or signalling sequences. These may often be identified by frequent-word analyses as being particularly abundant. Accurate statistics is needed to assess the statistical significance of these word frequencies. The set of shuffled sequences - letter sequences having the same k-word composition, for some choice of k, as the sequence being analysed - is considered the most appropriate sample space for analysing word counts. However, little is known about these word counts. Here we present exact formulae for word counts in shuffled sequences.

Keywords

Sequence shuffling Markov chain word count exact distribution hypergeometric distribution generalised hypergeometric series moment generating function genome sequence analysis directed graph Euler path

MSC classification

Primary: 60C05: Combinatorial probability

Secondary: 05A15: Exact enumeration problems, generating functions 60J10: Markov chains (discrete-time Markov processes on discrete state spaces) 60J20: Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) 62E15: Exact distribution theory

Type: General Applied Probability
Information: Advances in Applied Probability , Volume 38 , Issue 1 , March 2006 , pp. 116 - 133

DOI: https://doi.org/10.1239/aap/1143936143 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2006

References

Altschul, S. and Erickson, B. (1985). Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Molec. Biol. Evol. 2, 526–538.Google Scholar

Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle's formula. J. Appl. Prob. 28, 886–892.Google Scholar

Dawson, R. and Good, I. (1957). Exact Markov probabilities from oriented linear graphs. Ann. Math. Statist. 28, 946–956.Google Scholar

Fitch, W. (1983). Random sequences. J. Molec. Biol. 163, 171–176.Google Scholar

Goodman, L. (1958). Exact probabilities and asymptotic relationships for some statistics from m-th order Markov chains. Ann. Math. Statist. 29, 476–490.Google Scholar

Kandel, D., Matias, Y., Unger, R. and Winkler, P. (1996). Shuffling biological sequences. Discrete Appl. Math. 71, 171–185.CrossRef Google Scholar

Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205–220.Google Scholar

Robin, S. and Daudin, J. (1999). Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179–193.CrossRef Google Scholar

Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences. J. Comput. Biol. 8, 349–359.Google Scholar

Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist. 1, 1–16.Google Scholar

Schbath, S. (1995). Étude asymptotique du nombre d'occurrences d'un mot dans une chaıcirc;ne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN. , Université René Descartes, Paris V.Google Scholar

Tutte, W. (1948). The dissection of equilateral triangles into equilateral triangles. Proc. Camb. Philos. Soc. 44, 463–482.Google Scholar

Van Aardenne-Ehrenfest, T. and de Bruijn, N. (1951). Circuits and trees in oriented linear graphs. Simon Stevin 28, 203–217.Google Scholar

Whittle, P. (1955). Some distribution and moment formulae for the Markov chain. J. R. Statist. Soc. B 17, 235–242.Google Scholar

Zaman, A. (1984). Urn models for Markov exchangeability. Ann. Prob. 12, 223–229.Google Scholar

Article contents

Exact distribution of word counts in shuffled sequences

Abstract

Keywords

MSC classification

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests