ON PATTERN FREQUENCY OCCURRENCES IN A MARKOVIAN SEQUENCE ∗† Mireille R´ egnier Wojciech Szpankowski § INRIA Department of Computer Science Rocquencourt Purdue University 78153 Le Chesnay Cedex W. Lafayette, IN 47907 France U.S.A. Mireille.Regnier@inria.fr spa@cs.purdue.edu Abstract Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulæ for moments (including the variance), and probability of r pattern occurrences for three different regions of r, namely: (i) r = O(1), (ii) central limit regime, and (iii) large deviations regime. In order to derive these results, we first construct certain language expressions that character- ize pattern occurrences which are later translated into generating functions. We then use analytical methods to extract asymptotic behaviors of the pattern frequency from the gen- erating functions. These findings are of particular interest to molecular biology problems (e.g., finding patterns with unexpectedly high or low frequencies, and gene recognition), information theory (e.g., second-order properties of the relative frequency), and pattern matching algorithms (e.g., q-gram algorithms). Key Words: Frequency of pattern occurrences, Markov source, autocorrelation polynomi- als, languages, generating functions, asymptotic analysis, large deviations. This paper was presented in part at the 1997 Intern. Symp. on Information Theory, Ulm, Germany. This research was supported by NATO Collaborative Grant CRG.950060. Part of this work was done during authors visits at Purdue University and at INRIA, Rocquencourt. This work was additionally supported by ESPRIT LTR Project No. 20244 (ALCOM-IT) and GREG ”Motifs dans les Sequences”. § This research was additionally supported by NSF Grants CCR-9201078, NCR-9206315 and NCR- 9415491. 1