A Word Count Statistic in Computational Biology

Michael S. Waterman
University of Southern California
835 W. 37th Street
Los Angeles, CA 90089-1340
USA


Abstract    Full Text PDF

Sequence comparison and database searching are among of the most frequent and useful activities in computational biology and bioinformatics. The goal is to discover relationships between sequences and thus to suggest biological features previously unknown. As the sizes of biological sequence databases grow, more efficient comparison methods are required to carry out the large number of comparisons. The statistic consdered in this talk is based on the number of k-words common to two random sequences. Estimates of significance use both Poisson and normal approximations to the distribution of the random variables.