A. Sanfeliu et al. (Eds.): CIARP 2004, LNCS 3287, pp. 478–486, 2004. © Springer-Verlag Berlin Heidelberg 2004 A Fast Algorithm to Find All the Maximal Frequent Sequences in a Text René A. García-Hernández, José Fco. Martínez-Trinidad, and Jesús Ariel Carrasco-Ochoa National Institute of Astrophysics, Optics and Electronics (INAOE) Puebla, México {renearnulfo,fmartine,ariel}@inaoep.mx Abstract. One of the sequential pattern mining problems is to find the maximal frequent sequences in a database with a β support. In this paper, we propose a new algorithm to find all the maximal frequent sequences in a text instead of a database. Our algorithm in comparison with the typical sequential pattern min- ing algorithms avoids the joining, pruning and text scanning steps. Some ex- periments have shown that it is possible to get all the maximal frequent se- quences in a few seconds for medium texts. 1 Introduction The Knowledge Discovery in Databases (KDD) is defined by Fayyad [1] as “the non- trivial process of identifying valid, novel, potentially useful and ultimately under- standable patterns in data”. The key step in the knowledge discovery process is the data mining step, which following Fayyad: “consisting on applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data”. This definition has been extended to Text Mining (TM) like: “consisting on applying text analysis and discov- ery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the text”. So TM is the process that deals with the extraction of patterns from textual data. This definition is used by Feldman [2] to define Knowledge Discovery in Texts (KDT). In both KDD and KDT tasks, especial attention is required in the performance of the algorithms because they are applied on a large amount of information. In especial the KDT process needs to define simple structures that can be extracted from texts automatically and in a reasonable time. These structures must be rich enough to allow interesting KD operations [2]. The frequent sequences are of interest in some areas, such as data compression, human genome analysis and in the KDD process. But some of these areas are more interested in the maximal frequent sequences (MFS) because these areas search the longest pattern that could match or that could be extracted from the database. The sequential pattern mining problem is defined by Agrawal [3] as the problem of find- ing MFS in a database; this is a data mining problem. Therefore, we are interested in finding all the MFS in a text, in order to do text mining for the KDT process. MFS have received special attention in TM because this kind of patterns can be ex- tracted from text independently of the language. Also they are human readable pat- terns or descriptors of the text. MFS can be used to predict or to determine the causal- ity of an event. For information retrieval systems, MFS can be used to find keywords; in this case, the MFS are key phrases. MFS allow constructing links between docu-