M. Chetty, A. Ngom, and S. Ahmad (Eds.): PRIB 2008, LNBI 5265, pp. 359–372, 2008. © Springer-Verlag Berlin Heidelberg 2008 Multi-relational Data Mining for Tetratricopeptide Repeats (TPR)-Like Superfamily Members in Leishmania spp.: Acting-by-Connecting Proteins Karen T. Girão, Fátima C.E. Oliveira, Kaio M. Farias, Italo M.C. Maia, Samara C. Silva, Carla R.F. Gadelha, Laura D.G. Carneiro, Ana C.L. Pacheco, Michel T. Kamimura, Michely C. Diniz, Maria C. Silva, and Diana M. Oliveira Núcleo Tarcisio Pimenta de Pesquisa Genômica e Bioinformática – NUGEN, Faculdade de Veterinária, Universidade Estadual do Ceara – UECE, Av. Paranjana, 1700 – Campus do Itaperi, Fortaleza, CE 60740-000 Brazil diana.magalhaes@uece.br Abstract. The multi-relational data mining (MRDM) approach looks for patterns that involve multiple tables from a relational database made of complex/structured objects whose normalized representation does require multiple tables. We have applied MRDM methods (relational association rule discovery and probabilistic relational models) with hidden Markov models (HMMs) and Viterbi algorithm (VA) to mine tetratricopeptide repeat (TPR), pentatricopeptide (PPR) and half-a-TPR (HAT) in genomes of pathogenic protozoa Leishmania. TPR is a protein-protein interaction module and TPR- containing proteins (TPRPs) act as scaffolds for the assembly of different multiprotein complexes. Our aim is to build a great panel of the TPR-like superfamily of Leishmania. Distributed relational state representations for complex stochastic processes were applied to identification, clustering and classification of Leishmania genes and we were able to detect putative 104 TPRPs, 36 PPRPs and 08 HATPs, comprising the TPR-like superfamily. We have also compared currently available resources (Pfam, SMART, SUPER- FAMILY and TPRpred) with our approach (MRDM/HMM/VA). Keywords: Multi-relational data mining; hidden Markov models, Viterbi algo- rithm, tetratricopeptide repeat motif, Leishmania proteins. 1 Introduction Early efforts in bioinformatics concentrated on finding the internal structure of individual genome-wide data sets; with the explosion of the 'omics' technologies, comprehensive coverage of the multiple aspects of cellular/organellar physiology is progressing rapidly, generating vast amounts of data on mRNA profiles, protein/ metabolic abundances, and protein interactions encompassing a systems-level approach that requires integrating all of the known properties of a given class of components (e.g., protein abundance, localization, physical interactions, etc.) with computational methods able to combine large and heterogeneous sets of data [1]. A