Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition Yu-Dong Cai a, * and Kuo-Chen Chou b a Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai 200233, China b Upjohn Laboratories, Pﬁzer, Kalamazoo, MI 49007-4940, USA Received 21 April 2003 Abstract In this paper, based on the approach by combining the ‘‘functional domain composition’’ [K.C. Chou, Y.D. Cai, J. Biol. Chem. 277 (2002) 45765] and the pseudo-amino acid composition [K.C. Chou, Proteins Struct. Funct. Genet. 43 (2001) 246; Correction Proteins Struct. Funct. Genet. 2044 (2001) 2060], the Nearest Neighbour Algorithm (NNA) was developed for predicting the protein subcellular location. Very high success rates were observed, suggesting that such a hybrid approach may become a useful high- throughput tool in the area of bioinformatics and proteomics. Ó 2003 Elsevier Science (USA). All rights reserved. Keywords: Hybrid algorithm; Proteins subcellular location; Functional domain composition; Pseudo-amino acid composition; Bioinformatics; Proteomics Given the sequence of a protein, how to predict which subcellular location it belongs to? Owing to the fact that the localization of a protein in a cell is closely correlated with its biological function, and that the number of se- quences entering into databanks has been rapidly in- creasing, the importance of the problem has become self-evident. Particularly, it is anticipated that many more new protein sequences will be derived soon be- cause of the recent success of the human genome project, which has provided an enormous amount of genomic information in the form of three billion base pairs, as- sembled into tens of thousands of genes. Therefore, the challenge to address such a problem will become even more urgent and critical in the very near future. Actu- ally, many eﬀorts have been made trying to develop some computational methods for quickly predicting the subcellular locations of proteins [1–14]. Of these meth- ods, some [1,2] are based on the N-terminal sorting signals. Their merit is with a clear biological implication [15]. However, as pointed out by Reinhardt and Hub- bard [5], ‘‘In large genome analysis projects gene are usually automatically assigned and these assignments are often unreliable for the 5 0 -regions’’ ‘‘This can lead to leader sequences being missing or only partially in- cluded, thereby causing problems for prediction algo- rithms depending on them.’’ Therefore, most of the existing algorithms were based on the information de- rived from entire protein sequences rather than their signal peptides alone. However, because of the diﬃculty due to the extreme variance in sequence order and length, the majority of these algorithms are actually based on the amino acid composition of an entire pro- tein chain. According to the classical deﬁnition, the amino acid composition consists of 20 components, representing the occurrence frequency of each of the 20 native amino acids in a given protein, and hence a protein is represented by a 20D (dimensional) vector [16,17]. Obviously, if using the classical amino acid composition alone to represent a protein, all the se- quence-order and sequence-length eﬀects would be missed out and the prediction method underlain with Biochemical and Biophysical Research Communications 305 (2003) 407–411 www.elsevier.com/locate/ybbrc BBRC * Corresponding author. Present address: Biomolecular Sciences Department, UMIST, P.O. Box 88, Manchester, M60 1QD, UK. Fax: +44-161-236-0409. E-mail address: y.cai@umist.ac.uk (Y.-D. Cai). 0006-291X/03/$ - see front matter Ó 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/S0006-291X(03)00775-7