Abstract—The amount of the information being churned out by the field of biology has jumped manifold and now requires the extensive use of computer techniques for the management of this information. The predominance of biological information such as protein sequence similarity in the biological information sea is key information for detecting protein evolutionary relationship. Protein sequence similarity typically implies homology, which in turn may imply structural and functional similarities. In this work, we propose, a learning method for detecting remote protein homology. The proposed method uses a transformation that converts protein sequence into fixed-dimensional representative feature vectors. Each feature vector records the sensitivity of a protein sequence to a set of amino acids substrings generated from the protein sequences of interest. These features are then used in conjunction with support vector machines for the detection of the protein remote homology. The proposed method is tested and evaluated on two different benchmark protein datasets and it’s able to deliver improvements over most of the existing homology detection methods. Keywords—Protein homology detection; support vector machine; string kernel. I. INTRODUCTION HE recent years have witnessed a consistent surge in sequence information, caused by technological breakthroughs in large-scale sequencing projects. The main challenge facing biologist now, is to interpret this newly generated sequence data. One way to achieve this goal is through protein homology detection. Much research has already been done in protein homology detection. Dynamic programming based alignment tools such as Smith and Waterman [1] and their approximation such as FASTA [2] and BLAST [3] have been widely used by biologists around the world. Statistical model based methods have also been developed such as Profile [4] and hidden Markov models (HMM) [5]-[6]. Iterative methods such as PSI-BLAST [7] and SAM [8] improved upon profile-based methods. The SVM- Fisher method [9], which combines an iterative HMM training This work was financially supported by the Research Affairs at the UAE University under a contract no. 05-01-9-11/05. Nazar Zaki is an Assistant Professor with the College of Information Technology, United Arab Emirates University (UAEU). Al-Ain 17555 UAE, (phone: +971-50-7332135; fax: +971-3-7626309; e-mail: nzaki@uaeu.ac.ae). Safaai Deris is a Professor with the Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia, (e-mail: safaai@fsksm.utm.my). scheme with Support Vector Machine (SVM) [10]-[11], is currently among the well known methods for detecting remote protein homology. Other HMM base method is the HMM Combining Score (HMMcs) method [12]. HMMcs added more improvement over SVM-Fisher; however, both methods are appealing because they combine the rich biological information encoded in a profile HMM with the discriminative power of the SVM classifiers. In this case, we generally need lot of data or prior knowledge to train HMM [13]. Recently, two strings base methods are introduced. The first is the mismatch kernels method [13] and the second is the string kernel method designed by Zaki et al. [14]. In the second method, the authors introduced the application of the string kernel (SK) in classifying protein sequence. The string kernels approach has been shown to achieve good performance on text categorization tasks [15]. The basic idea is to compare two protein sequences by looking at common subsequences of a fix-length. These two methods were able to perform well on classifying protein sequence; however, no biological information is incorporated and the two techniques do not use any domain knowledge. They consider the protein dataset just as a long string of text. Other known method is the SVM-Pairwise method developed by Liao et al. [16]. The method means of representing proteins using pairwise sequence similarity scores. The drawback of this method is the fact that, when we compute the similarity scores, we consider all the sequence. It will be more meaningful if we could split the sequence into substrings and then measure the similarity score based on sensitive and non-sensitive regions. In this paper, we combined the advantage of using string kernel and incorporating some biological knowledge by using SVM-Pairwise concepts. The method uses a transformation that converts protein sequence into fixed-dimensional representative feature vectors where each feature vector records the sensitivity of a protein sequence to a set of amino acids substrings generated from the protein sequences of interest. The method is called SVM String Scoring (SVM-SS) method. Detecting Remote Protein Evolutionary Relationships via String Scoring Method Nazar Zaki and Safaai Deris T International Journal of Biological and Medical Sciences 2:1 2007 59