Protein Sequence Pattern Mining with Constraints Pedro Gabriel Ferreira ⋆ Paulo J. Azevedo ⋆⋆ University of Minho, Department of Informatics Campus of Gualtar, 4710-057 Braga, Portugal {pedrogabriel,pja}@di.uminho.pt Abstract. Considering the characteristics of biological sequence databases, which typically have a small alphabet, a very long length and a relative small size (several hundreds of sequences), we propose a new sequence mining algorithm (gIL). gIL was developed for linear sequence pattern mining and results from the combination of some of the most efficient techniques used in sequence and itemset mining. The algorithm exhibits a high adaptability, yielding a smooth and direct introduction of various types of features into the mining process, namely the extraction of rigid and arbitrary gap patterns. Both breadth or a depth first traversal are possible. The experimental evaluation, in synthetic and real life protein databases, has shown that our algorithm has superior performance to state-of-the art algorithms. The use of constraints has also proved to be a very useful tool to specify user interesting patterns. 1 Introduction In the development of sequence pattern mining algorithms, two communities can be considered: the Data Mining and the Bioinformatics community. The algorithms from the Data Mining community inherited some characteristics from the association rule mining algorithms. They are best suited for data with many (from hundred of thousands to millions) sequences with a relative small length (from 10 to 20), and an alphabet of thousands of events, e.g. [9, 7, 11, 1]. In the bioinformatics community, algorithms are developed in order to be very efficient when mining a small number of sequences (in the order of hundreds) with large lengths (few hundreds). The alphabet size is typically very small (ex: 4 for DNA and 20 for protein sequences). We emphasize the algorithm Teiresias [6] as a standard. The major problem with Sequence pattern mining is that it usually generates too many patterns. When databases attain considerable size or when the average ⋆ Supported by a PhD Scholarship (SFRH/BD/13462/2003) from Funda¸ c˜ ao Ciˆ encia e Tecnologia ⋆⋆ Supported by Funda¸ c˜ ao Ciˆ encia e Tecnologia - Programa de Financiamento Pluri- anual de Unidades de I & D, Centro de Ciˆ encias e Tecnologias da Computa¸ c˜ ao - Universidade do Minho