A Framework for Predicting Proteins 3D Structures
Rehab Duwairi
1
Department of Computer Science and
Engineering, Qatar University, Doha,
Qatar
Rehab.Duwairi@qu.edu.qa.
Amal Kassawneh
Department of Computer Science, Jordan
University of Science and Technology,
Irbid 22110, Jordan,
amal_khassawneh@hotmail.com.
1
On sabbatical leave from Jordan University of Science and Technology, Irbid, Jordan (rehab@just.edu.jo)
Abstract
This paper proposes a framework for predicting
protein three dimensional structures from their
primary sequences. The proposed method utilizes the
natural multi-label and hierarchical intrinsic nature of
proteins to build a multi-label and hierarchical
classifier for predicting protein folds. The classifier
predicts protein folds in two stages, at the first stage, it
predicts the protein structural class, and in the second
stage, it predicts the protein fold. When comparing our
technique with SVM, Naïve Bayes, and Boosted C4.5
we get a higher accuracy more than SVM and better
than Naïve Bayes when using the composition,
secondary structure and hydrophobicity feature
attributes, and give higher accuracy than C4.5 when
using composition, secondary structure,
hydrophobicity, and polarity feature attributes.
MuLAM was used as a basic classifier in the hierarchy
of the implemented framework. Two major
modifications were made to MuLAM, namely: the
pheromone update and term selection strategies of
MuLAM were altered.
1. Introduction
Proteins are large molecules made up of subunits
called amino acids. Chemical properties distinguish the
20 standard amino acids that cause the protein chains
to fold into a specific 3D structure which determines
their functions in the cell. There are four levels of
protein structure. The primary structure which refers to
the sequence of amino acids (called backbone). The
secondary structure is the ordered structure created by
hydrogen bonding within the protein backbone to form
three major states, namely; the α–helix, the β- sheet,
and the loop. The tertiary structure is formed by the
folding of a single polypeptide chain to form 3D
domains, and quaternary structure involves the
association of two or more polypeptide chains [1, 2].
One of the challenges in biological sciences is to
understand patterns in protein folding to predict the 3D
structure of a particular protein from its primary
structure (linear sequence of amino acids). There have
been several approaches to protein three-dimensional
structure prediction, including statistical techniques
[3], neural networks [4], hidden markov models [5],
support vector machines [6], nearest neighbor methods
[7, 10], and energy minimization [8].
Several approaches for predicting protein's folds
look at the classification problem as a single-label
classification task [14, 45, 46], where there is only one
class label to be predicted, the major contribution of
our work, is to use a multi-label classification where
there are two or more class labels to be predicted, and
to use a hierarchal classifier to predict the structural
classes and folds of proteins simultaneously, as in the
natural hierarchy of protein itself. As a result, the
classification contains one or more predictions, each
prediction involving a different class label from
different levels of the protein hierarchy. MuLAM [11]
was used at the protein structural class level and at the
protein fold level. MuLAM is an ant colony
optimization (ACO) algorithm [15]. The contributions
of this paper are summarized as follows:
• A new framework for predicting protein folds is
introduced (Called MH-PRO). This framework is
based on a multi-label and hierarchical classifier.
• The way MuLAM adds a term to a classification
rule was modified. The original MuLAM relies on
the roulette wheel selection techniques for
selecting terms to be added to the current rule (this
is a random technique). The current work utilizes a
term-correlation technique for adding terms to a
978-1-4244-1968-5/08/$25.00 ©2008 IEEE 37