Speeding up Subcellular Localization by Extracting Informative Regions of Protein Sequences for Proﬁle Alignment Wei Wang, Man-Wai Mak and Sun-Yuan Kung Abstract— The functions of proteins are closely related to their subcellular locations. In the post-proteomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by com- putational means. This paper proposes mitigating the compu- tation burden of alignment-based approaches to subcellular localization prediction by using the information provided by the N-terminal sorting signals. To this end, a cascaded fusion of cleavage site prediction and proﬁle alignment is proposed. Speciﬁcally, the informative segments of protein sequences are identiﬁed by a cleavage site predictor. Then, only the informative segments are applied to a homology-based classiﬁer for predicting the subcellular locations. Experimental results on a newly constructed dataset show that the method can make use of the best property of both approaches and can attain an accuracy higher than using the full-length sequences. Moreover, the method can reduce the computation time by 20 folds. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments. Index Terms— Subcellular localization; cleavage sites pre- diction; proﬁles alignment; protein sequences; support vector machines. I. I NTRODUCTION A. Motivation of Subcellular Localization Prediction Prediction of subcellular localization, which involves the computational prediction of where a protein resides in a cell, is a challenging task. Accurate prediction of subcellular loca- tions can assist the prioritization of proteins for downstream analysis and the identiﬁcation of drug targets. Because of the rapid increase in the number of sequenced genomes, it is highly desirable to develop effective prediction methods so that the newly found proteins can be effectively used in drug development. A number of approaches to solving this problem have been proposed in the literature. These methods can be generally divided into four categories, including predictions based on sorting signals [1], [2], [3], [4], [5], global sequence properties [6], [7], [8], [9], homology [10], [11], [12], and other information in addition to sequences [13], [14]. B. Approaches to Subcellular Localization Prediction Prediction based on sorting signals determines the local- ization of proteins via the recognition of their N-terminal Wei Wang and Man-Wai Mak are with the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR (email: enmwmak@polyu.edu.hk). Sun-Yuan Kung is with the Department of Electrical Engineering, Prince- ton University, USA This project was in part supported by the RGC of HKSAR project Nos. 5264/09E and 5251/08E. sorting signal. These cleavable peptides contain information that allows the protein to be transported to either the secretory pathway (in which case they are called signal peptides) or to mitochondria and chloroplast (in which case they are called transit peptides). PSORT [1] and its extension WoLF PSORT [2], [3] are some of the early methods that use the N-terminal information. PSORT is a knowledge- based program for predicting protein localization, and WoLF PSORT uses the information contained in sorting signals, amino acid composition and functional motifs to convert amino acid sequences into numerical localization features. More recent predictors such as TargetP [4], [5] use Hidden Markov models and neural networks to learn the relationship between the subcellular locations and amino acid sequences. The second group of prediction methods is based on the fact that proteins of different subcellular compartments differ in global properties, such as their amino acid composition. One of the early studies that use amino acid composition is SubLoc [6]. This method converts full-length protein sequences into 20-dim amino composition vectors for clas- siﬁcation by support vector machines. To incorporate the information of sequence order into the global properties, amino acid composition has been extended to amino-acid pair compositions (dipeptide) [7] and gapped amino-acid pair compositions [8]. One advantage of using global sequence properties is that genomic or EST (Expressed Sequence Tag) sequences without the N-terminus can be handled. It has been found that a simple odds-ratio statistics based on amino- acid composition and residue-pair frequencies can be used to discriminate between soluble intracellular and extracellular proteins [9]. The third group of prediction methods is based on the knowledge that homologs often share the same subcellular compartment. Given a query sequence, these methods use the sequence to search against databases for homologs [10], [11] and predict its subcellular location as the one to which the homologs belong. For example, Mak et al. [12] proposed a predictor called PairProSVM in which the proﬁle of an unknown sequence is aligned with the proﬁle of every training sequence to form a score vector for classiﬁcation by support vector machines. It was found that proﬁle alignment is more sensitive to the weak similarity between protein families than sequence alignment. Some predictors not only use peptide sequences as input but also require extra information such as lexical context in database entries [13] or Gene Ontology entries [14]. Although studies have shown that this type of method can outperform sequence-based methods, the performance has