A Framework for Predicting Proteins 3D Structures Rehab Duwairi 1 Department of Computer Science and Engineering, Qatar University, Doha, Qatar Rehab.Duwairi@qu.edu.qa. Amal Kassawneh Department of Computer Science, Jordan University of Science and Technology, Irbid 22110, Jordan, amal_khassawneh@hotmail.com. 1 On sabbatical leave from Jordan University of Science and Technology, Irbid, Jordan (rehab@just.edu.jo) Abstract This paper proposes a framework for predicting protein three dimensional structures from their primary sequences. The proposed method utilizes the natural multi-label and hierarchical intrinsic nature of proteins to build a multi-label and hierarchical classifier for predicting protein folds. The classifier predicts protein folds in two stages, at the first stage, it predicts the protein structural class, and in the second stage, it predicts the protein fold. When comparing our technique with SVM, Naïve Bayes, and Boosted C4.5 we get a higher accuracy more than SVM and better than Naïve Bayes when using the composition, secondary structure and hydrophobicity feature attributes, and give higher accuracy than C4.5 when using composition, secondary structure, hydrophobicity, and polarity feature attributes. MuLAM was used as a basic classifier in the hierarchy of the implemented framework. Two major modifications were made to MuLAM, namely: the pheromone update and term selection strategies of MuLAM were altered. 1. Introduction Proteins are large molecules made up of subunits called amino acids. Chemical properties distinguish the 20 standard amino acids that cause the protein chains to fold into a specific 3D structure which determines their functions in the cell. There are four levels of protein structure. The primary structure which refers to the sequence of amino acids (called backbone). The secondary structure is the ordered structure created by hydrogen bonding within the protein backbone to form three major states, namely; the α–helix, the β- sheet, and the loop. The tertiary structure is formed by the folding of a single polypeptide chain to form 3D domains, and quaternary structure involves the association of two or more polypeptide chains [1, 2]. One of the challenges in biological sciences is to understand patterns in protein folding to predict the 3D structure of a particular protein from its primary structure (linear sequence of amino acids). There have been several approaches to protein three-dimensional structure prediction, including statistical techniques [3], neural networks [4], hidden markov models [5], support vector machines [6], nearest neighbor methods [7, 10], and energy minimization [8]. Several approaches for predicting protein's folds look at the classification problem as a single-label classification task [14, 45, 46], where there is only one class label to be predicted, the major contribution of our work, is to use a multi-label classification where there are two or more class labels to be predicted, and to use a hierarchal classifier to predict the structural classes and folds of proteins simultaneously, as in the natural hierarchy of protein itself. As a result, the classification contains one or more predictions, each prediction involving a different class label from different levels of the protein hierarchy. MuLAM [11] was used at the protein structural class level and at the protein fold level. MuLAM is an ant colony optimization (ACO) algorithm [15]. The contributions of this paper are summarized as follows: A new framework for predicting protein folds is introduced (Called MH-PRO). This framework is based on a multi-label and hierarchical classifier. The way MuLAM adds a term to a classification rule was modified. The original MuLAM relies on the roulette wheel selection techniques for selecting terms to be added to the current rule (this is a random technique). The current work utilizes a term-correlation technique for adding terms to a 978-1-4244-1968-5/08/$25.00 ©2008 IEEE 37