Error bounds on multivariate Normal approximations for word count statistics

Haiyan Huang

doi:10.1239/aap/1033662166

Error bounds on multivariate Normal approximations for word count statistics

Part of: Markov processes Distribution theory

Published online by Cambridge University Press: 01 July 2016

Haiyan Huang

Show author details

Haiyan Huang*: Affiliation:
University of Southern California
*: ∗ Current address: Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA. Email address: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Given a sequence S and a collection Ω of d words, it is of interest in many applications to characterize the multivariate distribution of the vector of counts U = (N(S,w1), …, N(S,wd)), where N(S,w) is the number of times a word w ∈ Ω appears in the sequence S. We obtain an explicit bound on the error made when approximating the multivariate distribution of U by the normal distribution, when the underlying sequence is i.i.d. or first-order stationary Markov over a finite alphabet. When the limiting covariance matrix of U is nonsingular, the error bounds decay at rate O((log n) / √n) in the i.i.d. case and O((log n)3 / √n) in the Markov case. In order for U to have a nondegenerate covariance matrix, it is necessary and sufficient that the counted word set Ω is not full, that is, that Ω is not the collection of all possible words of some length k over the given finite alphabet. To supply the bounds on the error, we use a version of Stein's method.

Keywords

First-order stationary Markov chain full word set Stein's method word counts

MSC classification

Primary: 62E17: Approximations to distributions (nonasymptotic) 62E20: Asymptotic distribution theory 60J22: Computational methods in Markov chains 92D20: Protein sequences, DNA sequences

Type: General Applied Probability
Information: Advances in Applied Probability , Volume 34 , Issue 3 , September 2002 , pp. 559 - 586

DOI: https://doi.org/10.1239/aap/1033662166 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2002

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bhattacharya, R. N. and Ranga Rao, R. (1986). Normal Approximation and Asymptotic Expansions. Robert E. Krieger Publishing Co., Melbourne, FL (reprint of the 1976 original).Google Scholar

Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Math. 19, 521–545.Google Scholar

Breen, S., Waterman, M. S. and Zhang, N. (1985). Renewal theory for several patterns. J. Appl. Prob. 22, 228–234.CrossRef Google Scholar

Dembo, A. and Rinott, Y. (1994). Some examples of normal approximations by Stein's method. In Random Discrete Structures (IMA Vol. 76), Springer, New York, pp. 25–44.Google Scholar

Goetze, B. (1991). On the rate of convergence in the multivariate CLT. Ann. Prob. 19, 724–739.Google Scholar

Karlin, S. and Taylor, H. M. (1975). A First Course in Stochastic Processes, 2nd edn. Academic Press, New York.Google Scholar

Lundstrom, R. (1990). Stochastic models and statistical methods for DNA sequence data. , University of Southern California.Google Scholar

Prum, B., Rodolphe, F. and de Turckheim, È. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205–220.Google Scholar

Reinert, G. and Schbath, S. (1998). Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol. 5, 223–253.CrossRef Google Scholar PubMed

Reinert, G., Schbath, S. and Waterman, M. S. (2000). Probabilistic and statistical properties of words: an overview. J. Comput. Biol. 7, 1–46.CrossRef Google Scholar PubMed

Rinott, Y. and Rotar, V. (1996). A multivariate CLT for local dependence with n-1/2 log n rate and application to multivariate graph related statistics. J. Multivariate Anal. 56, 333–350.CrossRef Google Scholar

Rinott, Y. and Rotar, V. (1997). On coupling constructions and rates in the CLT for dependent summands with applications to the antivector model and weighted U-statistics. Ann. Appl. Prob. 7, 1080–1105.CrossRef Google Scholar

Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences. J. Comput. Biol. 8, 349–359.CrossRef Google Scholar PubMed

Sazonov, V. V. (1968). On the multi-dimensional central limit theorem. Sankhyā A 30, 181–204.Google Scholar

Tanushev, M. (1997). Joint central limit theorem for renewals of competing patterns. , University of Southern California.Google Scholar

Waterman, M. S. (1995). Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London.CrossRef Google Scholar

Article contents

Error bounds on multivariate Normal approximations for word count statistics

Abstract

Keywords

MSC classification

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests