Pattern occurrences in multicomponent models Massimiliano Goldwurm Violetta Lonati Dipartimento di Scienze dell’Informazione, Università degli Studi di Milano Via Comelico 39/41, 20135 Milano – Italy, {goldwurm,lonati}@dsi.unimi.it September 2004 Abstract In this paper we determine some limit distributions of pattern statistics in rational stochastic models, de- fined by means of nondeterministic weighted finite automata. We present a general approach to analyze these statistics in rational models having an arbitrary number of connected components. We explicitly establish the limit distributions in the most significant cases; these ones are characterized by a family of unimodal density functions defined by polynomials over adjacent intervals. Keywords: Automata and Formal Languages, Limit Distributions, Non-negative Matrices, Pattern Statistics, Rational Formal Series. 1 Introduction This work presents some results on the limit distribution of pattern statistics. The major problem in this context is to estimate the frequency of pattern occurrences in a random text. This is a classical problem that has applica- tions in several research areas of computer science and biology: for instance, it is considered in connection with the search of motifs in DNA sequences [6, 14] while the earlier motivations are related to code synchronization [10] and approximated pattern-matching [12, 18, 5]. In the usual setting, established in the seminal paper [11] and developed in many subsequent works (see for instance [15, 13, 3]), one considers a finite alphabet , a set of patterns , a probabilistic source generating words at random in , and studies the number of occurrences of elements of in a word of length generated by . Typical goals are the asymptotic evaluation of the moments of , its limit distribution (also in the local sense) and the corresponding large deviations. These results depend in particular on the stochastic model , which is usually assumed to be a Bernoulli or a Markovian model. A rather general result is obtained in [13], where Gaussian limit distributions are obtained, for any regular set of patterns and any Markovian source , under a primitivity hypothesis on the associated stochastic matrix. This result is extended in [2] to the so-called rational stochastic model, where the text is generated at random according to a probability distribution defined by means of a rational formal series in non-commutative variables. In particular cases, this is simply the uniform distribution over the set of words of given length in an arbitrary regular language. For this reason, results for this model are also related to the analysis of additive functions over strings [9]. The rational stochastic model properly extends the Markovian models in the following sense: the frequency problem of regular patterns in a text generated in the Markovian model (as studied in [13]) is a special case of the frequency problem of a single symbol in a text over a binary alphabet generated in the rational stochastic model; it is also known that the two models are not equivalent[2]. We recall that extensions of the Markovian models have already been considered in the literature [3]. Furthermore, finding results under more general probabilistic assumptions is of interest since, for some applications, the Markovian models seem to be too restrictive. 1