Modelling Proteolytic Enzymes With Support Vector Machines Lionel Morgado 1* , Carlos Pereira 1,2 , Paula Ver´ ıssimo 3 , Ant ´ onio Dourado 1 1 Center for Informatics and Systems of the University of Coimbra Polo II - University of Coimbra, 3030-290 Coimbra, Portugal 2 Instituto Superior de Engenharia de Coimbra Quinta da Nora, 3030-199 Coimbra, Portugal 3 Department of Biochemistry and Center for Neuroscience and Cell Biology of the University of Coimbra, 3004-517 Coimbra, Portugal Summary The strong activity felt in proteomics during the last decade created huge amounts of data, for which the knowledge is limited. Retrieving information from these proteins is the next step. For that, computational techniques are indispensable. Although there is not yet a silver bullet approach to solve the problem of enzyme detection and classiﬁcation, machine learning formulations such as the state-of-the-art Support Vector Machine (SVM) appear among the most reliable options. A SVM based framework for peptidase analysis, that recognizes the hierarchies demarked in the MEROPS database is presented. Feature selection with SVM-RFE is used to improve the discriminative models and build classiﬁers computationally more efﬁcient than alignment based techniques. 1 Introduction During the last decade massive amounts of protein data have been collected, making the pro- teomics ﬁeld attractive to the data mining and the machine learning communities. The au- tomated classiﬁcation of proteins has classically been done by means of sequence alignment methods like BLAST and PSI-BLAST [1], searching for similar homologues in a database. These approaches employ considerably extensive computation, considering that the time taken to get a single prediction for a real world sample using ordinary computers can reach several minutes when large databases are utilized. In these conditions, analyzing an average size pro- teome with few hundred thousands of samples can take a month. It is therefore important to ﬁnd other means for disclosing answers in a more acceptable period. The Support Vector Ma- chine (SVM) [18] makes part of the most successful methods applied to protein classiﬁcation and appears as a good candidate to solve the problem of peptidase categorization. Since protein classiﬁcation is a fundamental task in biology, there is a vast work concerning discriminative classiﬁers dedicated to subjects such as homology detection [3, 4, 5, 6, 7, 8], structure recog- nition [9, 10, 11], and protein localization [12, 13], among others. Other important problems in molecular biology include peptidase detection and classiﬁcation. Peptidases (also known as proteases or proteolytic enzymes) are proteins that can catalyze biochemical reactions like digestion, signal transduction or cell regulation, and represent around 2% of the proteins from * To whom correspondence should be adressed. E-mail: lionel@dei.uc.pt Journal of Integrative Bioinformatics, 8(3):170, 2011 http://journal.imbio.de doi:10.2390/biecoll-jib-2011-170 1 Copyright 2011 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).