Journal of Theoretical and Applied Information Technology
© 2005 - 2009 JATIT. All rights reserved.
www.jatit.org
101
Vol6. No1. (pp 101 - 105)
A PROBABILISTIC NEURAL NETWORK APPROACH FOR
PROTEIN SUPERFAMILY CLASSIFICATION
PV NAGESWARA RAO
1
(NAGESH@GITAM.EDU), T UMA DEVI
1
, DSVGK KALADHAR
1
,
GR SRIDHAR
2
, ALLAM APPA RAO
3
1
GITAM University,
2
Endocrine Society, Visakhapatnam,
3
JNTU, Kakinada, India
ABSTRACT
The protein superfamily classification problem, which consists of determining the superfamily membership
of a given unknown protein sequence, is very important for a biologist for many practical reasons, such as
drug discovery, prediction of molecular function and medical diagnosis. In this work, we propose a new
approach for protein classification based on a Probabilistic Neural Network and feature selection. Our goal
is to predict the functional family of novel protein sequences based on the features extracted from the
protein’s primary structure i.e., sequence only. For this purpose, the datasets are extracted form Protein
Data Bank(PDB), a curated protein family database, are used as training datasets. In these conducted
experiments, the performance of the classifier is compared to other known data mining approaches /
sequence comparison methods. The computational results have shown that the proposed method performs
better than the other ones and looks promising for problems with characteristics similar to the problem.
Key words: Probabilistic Neural Network, Classification, Feature Extraction, Bioinformatics.
1. INTRODUCTION
Proteins are complex organic macromolecules
made up of amino acids. They are fundamental
components of all living cells and include many
substances, such as enzymes, structural elements
and antibodies, which are directly related with the
functioning of an organism[1]. Hence the
knowledge of the proteins biological actions
(functions) is very important. Until recently, the
functions of the proteins could be identified only
by time-consuming and expensive experiments.
However, in the post-genomic era, with the huge
amount of available sources of information, new
challenges arise in protein function
characterization[2]. Moreover, computer based
methods to assist in this process are becoming
increasingly important. The need for faster
sequence classification algorithms has been
demonstrated by Cameron G et al.[3].
2. RELATED WORKS
Techniques used for biological sequence
classification fall into two categories:
Similarity search: This approach is to classify
unlabeled test sequences by searching for either
global similarities or local similarities in the
sequences. Global similarity search involves
either pair-wise sequence comparison, or multiple
sequence alignment. Local similarity search is to
find patterns in sequences[4].
Machine Learning: This approach was surveyed
by Haussler D[5]. Various machine learning
techniques have been applied to biological
sequences classification. For example, hidden
Markov Model has been used in gene
identification as well as protein family modeling.
Neural Networks have been applied to the analysis
of biological sequences[6].
3. THE PROPOSED METHOD
Feature Extraction: The majority of real-world
classification problems require supervised learning
where the underlying class probabilities and class-
conditional probabilities are unknown, and each
instance is associated with a class label. In real-
world situations, relevant features are often
unknown a priori. Therefore, many candidate