Exact Distribution of the Occurrence Number for *K*-tuples Over an Alphabet
of Non-Equal Probability Letters

Chan Zhou^{1}, and Huimin Xie^{2}

zhouchan99@zju.edu.cn

szhmxie@pub.sz.jsinfo.net

Annals of Combinatorics 8 (4) p.499-506 December, 2004

Abstract:

A nucleotide sequence can be considered as a realization of the non-equal-probability
independently and identically distributed (niid) model. In this paper we derive
the exact distribution of the occurrence number for each *K*-tuple with respect
to the niid model by means of the Goulden-Jackson cluster method. An application
of the probability function to get exact expectation curves [9] is presented, accompanied
by comparison between the exact approach and the approximate solution.

References:

1. J.F. Gentleman and R.C. Mullin, The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability, Biometrics 45 (1989) 35--42.

2. J.F. Gentleman, The distribution of the frequency of subsequences in alphabetic sequences, as exemplified by deoxyribonucleic acid, J. Roy. Statist. Soc. Ser. C 43 (1994) 401--414.

3. B.-L. Hao, Fractals from genomes--exact solutions of a biology-inspired problem, Physica A 282 (2002) 225--246.

4. J. Noonan and D. Zeilberger, The Goulden-Jackson cluster method: extensions, applications and implementions, J. Differ. Equations Appl. 5 (1999) 355--377.

5. O.E. Percus and P.A. Whitlock, Theory and application of Marsaglia's monkey test for pseudorandom number generators, ACM Transactions on Modelling and Computer Simulation 5 (1995) 87--100.

6. P.A. Pevzner, M.Y. Borodovsky, and A.A. Mironov, Linguistics of nucleotide sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words, J. Biomol. Struct. Dyn. 6 (1989) 1013--1026.

7. G. Reinert, S. Schbath, and M.S.Waterman, Probabilistic and statistical properties of words: an overview, J. Comput. Biol. 7 (2000) 1--46.

8. M.S. Waterman, Introduction to Computational Biology, Chapman & Hall, London, 1995.

9. H.-M. Xie and B.-L. Hao, Visualization of K-tuple distribution in procaryote complete genomes and their randomized counterparts, In: CSB Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, Ca, 2002, pp. 31--42.