ON PATTERN FREQUENCY OCCURRENCES IN A MARKOVIAN SEQUENCE ∗† Mireille R´ egnier ‡ Wojciech Szpankowski § INRIA Department of Computer Science Rocquencourt Purdue University 78153 Le Chesnay Cedex W. Lafayette, IN 47907 France U.S.A. Mireille.Regnier@inria.fr spa@cs.purdue.edu Abstract Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulæ for moments (including the variance), and probability of r pattern occurrences for three different regions of r, namely: (i) r = O(1), (ii) central limit regime, and (iii) large deviations regime. In order to derive these results, we first construct certain language expressions that character- ize pattern occurrences which are later translated into generating functions. We then use analytical methods to extract asymptotic behaviors of the pattern frequency from the gen- erating functions. These findings are of particular interest to molecular biology problems (e.g., finding patterns with unexpectedly high or low frequencies, and gene recognition), information theory (e.g., second-order properties of the relative frequency), and pattern matching algorithms (e.g., q-gram algorithms). Key Words: Frequency of pattern occurrences, Markov source, autocorrelation polynomi- als, languages, generating functions, asymptotic analysis, large deviations. ∗ This paper was presented in part at the 1997 Intern. Symp. on Information Theory, Ulm, Germany. † This research was supported by NATO Collaborative Grant CRG.950060. Part of this work was done during authors visits at Purdue University and at INRIA, Rocquencourt. ‡ This work was additionally supported by ESPRIT LTR Project No. 20244 (ALCOM-IT) and GREG ”Motifs dans les Sequences”. § This research was additionally supported by NSF Grants CCR-9201078, NCR-9206315 and NCR- 9415491. 1