A Hybrid Approach for Indexing and Searching Protein Structures Tarek F. Gharib Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt E-mail: tgharib@eun.eg Abstract: - Searching for structural similarities of proteins has a central role in bioinformatics. Most tasks of bioinformatics depends on investigating the homologous protein's sequence or structure these tasks vary from predicting the protein structure to determine sites in protein structure where drug can be attached. Protein structure comparison problem is extremely important in many tasks. It can be used for determining function of protein, for clustering a given set of proteins by their structure, for assessment in protein fold prediction. Protein Structure Indexing using Suffix Array and Wavelet (PSISAW) is a hybrid approach that provides the ability to retrieve similarities of proteins based on their structures. Indexing the protein structure is one approach of searching for protein similarities. The suffix arrays are used to index protein structure and the wavelet is used to compress the indexed database. Compressing the indexed database is supposed to make the searching time faster and memory usage lower but it affects the accuracy with accepted rate of error.The experimental results, which are based on the structural classification of proteins (SCOP) dataset, show that the proposed approach outperforms existing similar techniques in memory utilization and searching speed. The results show an enhancement in the memory usage with factor 50%. Key-Words: - protein structures, indexing, suffix array, wavelet 1 Introduction Searching for structural similarities has a critical role in many applications like prediction of protein's structure and functions, classification of proteins and drug design and discovery. Proteins with homologous sequence or structure can be concluded to have a common ancestor which is helpful for better understanding of life tree. There have been several methods proposed to compare protein structures and measure the degree of structural similarity between them. These methods have been based on alignment of secondary structure elements as well as alignment of intra and inter-molecular atomic distances [5]. The following are some of the reasons why the structure comparison problem is also extremely important [7]: 1. For determining function: The function of a new protein can be determined by comparing its structure to some known ones. That is, given a set of proteins whose fold has already been determined and whose function is known, if a new one has a fold highly similar to a known one, then its function will similar as well. This type of problems implies the design of search algorithm for 3D databases, where a match must be based on structure similarity. Analogous problems have already been studied in Computational Geometry and Computer Vision, where a geometric form or object has to be identified by comparing it to a set of known ones. 2. For clustering: Given a set of proteins and their structures, we may want to cluster them in families based on structure similarity. Furthermore, we may want to identify a consensus structure for each family. In this case, we would have to solve a multiple structure alignment problem. 3. For assessment of fold Predictions: The Model Assessment Problem is the following: Given a set of “tentative” folds for a protein, and a “correct” one (determined experimentally), which of the guesses is the closest to the true? This is, e.g., the problem faced by the CASP (Critical Assessment of Structure Prediction) jurors, in a biannual competition where many research groups try to predict protein structure from sequence. The large number of predictions submitted WSEAS TRANSACTIONS on COMPUTERS Tarek F. Gharib ISSN: 1109-2750 966 Issue 6, Volume 8, June 2009