M. Chetty, A. Ngom, and S. Ahmad (Eds.): PRIB 2008, LNBI 5265, pp. 359–372, 2008.
© Springer-Verlag Berlin Heidelberg 2008
Multi-relational Data Mining for Tetratricopeptide
Repeats (TPR)-Like Superfamily Members in Leishmania
spp.: Acting-by-Connecting Proteins
Karen T. Girão, Fátima C.E. Oliveira, Kaio M. Farias, Italo M.C. Maia,
Samara C. Silva, Carla R.F. Gadelha, Laura D.G. Carneiro, Ana C.L. Pacheco,
Michel T. Kamimura, Michely C. Diniz, Maria C. Silva, and Diana M. Oliveira
Núcleo Tarcisio Pimenta de Pesquisa Genômica e Bioinformática – NUGEN, Faculdade de
Veterinária, Universidade Estadual do Ceara – UECE, Av. Paranjana, 1700 – Campus do
Itaperi, Fortaleza, CE 60740-000 Brazil
diana.magalhaes@uece.br
Abstract. The multi-relational data mining (MRDM) approach looks for
patterns that involve multiple tables from a relational database made of
complex/structured objects whose normalized representation does require
multiple tables. We have applied MRDM methods (relational association rule
discovery and probabilistic relational models) with hidden Markov models
(HMMs) and Viterbi algorithm (VA) to mine tetratricopeptide repeat (TPR),
pentatricopeptide (PPR) and half-a-TPR (HAT) in genomes of pathogenic
protozoa Leishmania. TPR is a protein-protein interaction module and TPR-
containing proteins (TPRPs) act as scaffolds for the assembly of different
multiprotein complexes. Our aim is to build a great panel of the TPR-like
superfamily of Leishmania. Distributed relational state representations for
complex stochastic processes were applied to identification, clustering and
classification of Leishmania genes and we were able to detect putative 104
TPRPs, 36 PPRPs and 08 HATPs, comprising the TPR-like superfamily. We
have also compared currently available resources (Pfam, SMART, SUPER-
FAMILY and TPRpred) with our approach (MRDM/HMM/VA).
Keywords: Multi-relational data mining; hidden Markov models, Viterbi algo-
rithm, tetratricopeptide repeat motif, Leishmania proteins.
1 Introduction
Early efforts in bioinformatics concentrated on finding the internal structure of
individual genome-wide data sets; with the explosion of the 'omics' technologies,
comprehensive coverage of the multiple aspects of cellular/organellar physiology is
progressing rapidly, generating vast amounts of data on mRNA profiles, protein/
metabolic abundances, and protein interactions encompassing a systems-level
approach that requires integrating all of the known properties of a given class of
components (e.g., protein abundance, localization, physical interactions, etc.) with
computational methods able to combine large and heterogeneous sets of data [1]. A