Boosting classiﬁer for predicting protein domain structural class Kai-Yan Feng a , Yu-Dong Cai b , Kuo-Chen Chou c, * a Imaging Science and Biomedical Engineering, Medical School, The University of Manchester, Manchester, M13 9PT, UK b Biomolecular Sciences Department, University of Manchester Institute of Science and Technology, Post Box 88, Manchester, M60 1QD, UK c Gordon Life Science Institute, 13784 Torrey Del Mar, San Diego, CA 92130, USA Received 9 June 2005 Available online 27 June 2005 Abstract A novel classiﬁer, the so-called ‘‘LogitBoost’’ classiﬁer, was introduced to predict the structural class of a protein domain accord- ing to its amino acid sequence. LogitBoost is featured by introducing a log-likelihood loss function to reduce the sensitivity to noise and outliers, as well as by performing classiﬁcation via combining many weak classiﬁers together to build up a very strong and robust classiﬁer. It was demonstrated thru jackknife cross-validation tests that LogitBoost outperformed other classiﬁers including ‘‘support vector machine,’’ a very powerful classiﬁer widely used in biological literatures. It is anticipated that LogitBoost can also become a useful vehicle in classifying other attributes of proteins according to their sequences, such as subcellular localization and enzyme family class, among many others. Ó 2005 Elsevier Inc. All rights reserved. Keywords: Domain structural classiﬁcation; Binary LogitBoost; One-vs-others LogitBoost; AdaBoost; Support vector machines; Neural network Although the details of the three-dimensional struc- tures of proteins and domains therein are extremely complicated and irregular, their overall folding patterns are surprisingly simple, regular, and strikingly beautiful from the aesthetical point of view [1–4]. Many protein domains often have similar or identical folding patterns even if they are quite diﬀerent according to their sequences [5–8]. Actually, about three decades ago Le- vitt and Chothia tried to classify proteins into the fol- lowing four structural classes: (1) all-a (Fig. 1A) that is formed essentially by a-helices, (2) all-b (Fig. 1B) essentially by b-strands, (3) a/b (Fig. 1C) containing both a-helices and b-strands that are largely interspersed in forming mainly parallel b-sheets, and (4) a + b (Fig. 1D) containing also both of the two secondary structure elements that, however, are largely segregated in form- ing mainly antiparallel b-sheets. The structural class has ever since become an important attribute for charac- terizing the overall folding type of a protein or its domain. Prediction of protein structural class is an important topic in protein science (see, e.g., a review [9]). A series of previous studies have shown that some correlation between the protein structural class and amino acid composition does exist. Actually many eﬀorts were made to predict the structural classes of proteins based on their amino acid composition [10–20]. Here we would like to introduce a novel approach, the so-called ‘‘Log- itBoost’’ [21], for predicting the protein structural clas- ses. Because an individual domain is the most basic unit in structural classiﬁcation [22], the present study will focus on protein domains. Boosting algorithms Boosting was originally proposed to combine several weak classiﬁers together to improve the classiﬁcation performance. Boosting has been used to solve various 0006-291X/$ - see front matter Ó 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2005.06.075 * Corresponding author. E-mail address: kchou@san.rr.com (K.-C. Chou). www.elsevier.com/locate/ybbrc Biochemical and Biophysical Research Communications 334 (2005) 213–217 BBRC