Protein Structure Searching using Suffix Arrays Tarek F. Gharib 1 A. Salah 2 I. M. El Henawy 2 Abdel-Badeeh M. Salem 1 1 Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt E-mail: tgharib@asunet.shams.edu.eg 2 Faculty of Computer and Information Systems, Zagazig University, Zagazig, Egypt Abstract - Searching for similarities of proteins using Structured-based query, has a vital role in many applications like drug discovery and drug design, disease diagnosis and treatment and protein classification. Indexing the protein structure is one approach of searching protein structure for similarities. In this paper we proposed a method to enhance the memory space for storing the indexed data without affecting other performance criteria. Our technique starts by extracting the local feature vectors of proteins structures. Normalization is applied to these vectors components. Finally we use the generalized suffix array to index these vectors. Suffix array is used to return the maximal structural similarities as a result for a structured query. The experimental results, which based on the structural classification of protein (SCOP) dataset, show that our method outperforms existing similar methods in memory utilization. Our results show an enhancement in the memory usage with factor exceeds 50%. Keywords: protein structures, indexing, suffix array 1 Introduction The rapid growth of the Protein Databank (PDB) current holdings, > 40000 at the last quarter of 2007, raises the need for new tools that perform proteins similarity searching to clarify the similarities in the three dimensional structures between related or similar proteins. Most of these tools search the protein structure rather than protein sequence, which is a sequence of amino acid molecules, because of the relation between protein shape and its function, in other words, proteins that have similar functionality have similar structure besides it might have not the same primary structure [3]. But if a set of proteins have the same primary structure then they will have the same functionality. So the importance of the 3-D shape of protein comes from that the function of protein depends on its shape rather than its sequence (primary structure). Proteins Primary structure is a sequence of letters that states the amino acid in this protein. Protein Secondary structure is a 3D description of the proteins as a sequence of local segment of proteins. Searching the protein structure has another problem, besides the rapidly growing rate of proteins in PDB, which is the complexity. The protein structure alignment is a NP- hard problem. Many methods were proposed to solve this problem. Searching for similarities in database is a problem approached by several ways. First it was approached by sequence alignment [9], but because of the link between protein structure and its functionality this raised the need for structural alignment. That means we can search for partial structure similarities between proteins. Several approaches were proposed to solve this problem. Pair-wise structural alignment algorithms can perform the alignment at the secondary structure elements SSEs level or intra and inter-molecular atomic level [4]. However, pair-wise alignment is not feasible for large databases with more than few thousands of proteins; PSI belongs to this class [6]. Database searching using information retrieval techniques [7] and indexing the protein structure using suffix tree [5] are examples of approaches that don’t follow the pair-wise alignment. Protein structure index (PSI) method prunes unpromising protein for the given protein query. It is based on extracting feature vector for each protein in database then indexing it using the R* tree. R* trees are used to prune the search space to be used by VAST structural alignment algorithm, this reduction in search space resulting in reduce the searching time [6]. Protein Structure Indexing using Suffix Trees (PSIST) convert the 3D structure to a sequence by extracting feature vector for each protein in the database. The feature vector includes the distance between each two residues and the angle between their plans. Each protein is described as a list of vectors. Each vector is converted to a unique symbol, that map the list of vectors to a sequence (String) that can be fit in a suffix tree which is an indexing structure that speedup the searching[5]. In this paper, we present a proposed method for indexing the protein structure. The method starts by extracting feature vectors from the protein structure so that these feature vectors are invariant to translation and rotation. Each feature vector represents one residue, its components are the two torsion angles phi and psi and the distance between the Cα atom of this residue and the Cα atom of the pervious atom. To reduce the vector space we apply the normalization. After normalizing the feature vectors we convert feature vectors to a sequence of symbols.