<%@ Page Language="C#" MasterPageFile="~/Main.master" AutoEventWireup="true" Title="Volume8 Issue4" %>
Exact Distribution of the Occurrence Number for K-tuples Over an Alphabet of Non-Equal Probability Letters
Chan Zhou1, and Huimin Xie2
1Chu Kechen Honors College, Zhejiang University, Hangzhou 310027, China
2Department of Mathematics, Suzhou University, Suzhou 215006, China
Annals of Combinatorics 8 (4) p.499-506 December, 2004
AMS Subject Classification: 60C05, 05A15, 92B99
A nucleotide sequence can be considered as a realization of the non-equal-probability independently and identically distributed (niid) model. In this paper we derive the exact distribution of the occurrence number for each K-tuple with respect to the niid model by means of the Goulden-Jackson cluster method. An application of the probability function to get exact expectation curves [9] is presented, accompanied by comparison between the exact approach and the approximate solution.
Keywords: K-tuple, exact distribution, Goulden-Jackson cluster method, probability generating function, expectation curve of K-histogram


1. J.F. Gentleman and R.C. Mullin, The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability, Biometrics 45 (1989) 35--42.

2. J.F. Gentleman, The distribution of the frequency of subsequences in alphabetic sequences, as exemplified by deoxyribonucleic acid, J. Roy. Statist. Soc. Ser. C 43 (1994) 401--414.

3. B.-L. Hao, Fractals from genomes--exact solutions of a biology-inspired problem, Physica A 282 (2002) 225--246.

4. J. Noonan and D. Zeilberger, The Goulden-Jackson cluster method: extensions, applications and implementions, J. Differ. Equations Appl. 5 (1999) 355--377.

5. O.E. Percus and P.A. Whitlock, Theory and application of Marsaglia's monkey test for pseudorandom number generators, ACM Transactions on Modelling and Computer Simulation 5 (1995) 87--100.

6. P.A. Pevzner, M.Y. Borodovsky, and A.A. Mironov, Linguistics of nucleotide sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words, J. Biomol. Struct. Dyn. 6 (1989) 1013--1026.

7. G. Reinert, S. Schbath, and M.S.Waterman, Probabilistic and statistical properties of words: an overview, J. Comput. Biol. 7 (2000) 1--46.

8. M.S. Waterman, Introduction to Computational Biology, Chapman & Hall, London, 1995.

9. H.-M. Xie and B.-L. Hao, Visualization of K-tuple distribution in procaryote complete genomes and their randomized counterparts, In: CSB Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, Ca, 2002, pp. 31--42.