Journal of Theoretical and Applied Information Technology © 2005 - 2009 JATIT. All rights reserved. www.jatit.org 101 Vol6. No1. (pp 101 - 105) A PROBABILISTIC NEURAL NETWORK APPROACH FOR PROTEIN SUPERFAMILY CLASSIFICATION PV NAGESWARA RAO 1 (NAGESH@GITAM.EDU), T UMA DEVI 1 , DSVGK KALADHAR 1 , GR SRIDHAR 2 , ALLAM APPA RAO 3 1 GITAM University, 2 Endocrine Society, Visakhapatnam, 3 JNTU, Kakinada, India ABSTRACT The protein superfamily classification problem, which consists of determining the superfamily membership of a given unknown protein sequence, is very important for a biologist for many practical reasons, such as drug discovery, prediction of molecular function and medical diagnosis. In this work, we propose a new approach for protein classification based on a Probabilistic Neural Network and feature selection. Our goal is to predict the functional family of novel protein sequences based on the features extracted from the protein’s primary structure i.e., sequence only. For this purpose, the datasets are extracted form Protein Data Bank(PDB), a curated protein family database, are used as training datasets. In these conducted experiments, the performance of the classifier is compared to other known data mining approaches / sequence comparison methods. The computational results have shown that the proposed method performs better than the other ones and looks promising for problems with characteristics similar to the problem. Key words: Probabilistic Neural Network, Classification, Feature Extraction, Bioinformatics. 1. INTRODUCTION Proteins are complex organic macromolecules made up of amino acids. They are fundamental components of all living cells and include many substances, such as enzymes, structural elements and antibodies, which are directly related with the functioning of an organism[1]. Hence the knowledge of the proteins biological actions (functions) is very important. Until recently, the functions of the proteins could be identified only by time-consuming and expensive experiments. However, in the post-genomic era, with the huge amount of available sources of information, new challenges arise in protein function characterization[2]. Moreover, computer based methods to assist in this process are becoming increasingly important. The need for faster sequence classification algorithms has been demonstrated by Cameron G et al.[3]. 2. RELATED WORKS Techniques used for biological sequence classification fall into two categories: Similarity search: This approach is to classify unlabeled test sequences by searching for either global similarities or local similarities in the sequences. Global similarity search involves either pair-wise sequence comparison, or multiple sequence alignment. Local similarity search is to find patterns in sequences[4]. Machine Learning: This approach was surveyed by Haussler D[5]. Various machine learning techniques have been applied to biological sequences classification. For example, hidden Markov Model has been used in gene identification as well as protein family modeling. Neural Networks have been applied to the analysis of biological sequences[6]. 3. THE PROPOSED METHOD Feature Extraction: The majority of real-world classification problems require supervised learning where the underlying class probabilities and class- conditional probabilities are unknown, and each instance is associated with a class label. In real- world situations, relevant features are often unknown a priori. Therefore, many candidate