Journal of Cellular Biochemistry 91:1197–1203 (2004) Predicting Subcellular Localization of Proteins by Hybridizing Functional Domain Composition and Pseudo-Amino Acid Composition Kuo-Chen Chou 1,2 * and Yu-Dong Cai 3,4 1 Gordon Life Science Institute, Torrey Del Mar Drive, San Diego, California 92130 2 Tianjin Institute of Bioinformatics and Drug Discovery (TIBDD), Tianjin, China 3 Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai 200233, China 4 Biomolecular Science Department, UMIST, P.O. Box 88, Manchester, M60IQD, UK Abstract Recent advances in large-scale genome sequencing have led to the rapid accumulation of amino acid sequences of proteins whose functions are unknown. Since the functions of these proteins are closely correlated with their subcellular localizations, many efforts have been made to develop a variety of methods for predicting protein subcellular location. In this study, based on the strategy by hybridizing the functional domain composition and the pseudo-amino acid composition (Cai and Chou [2003]: Biochem. Biophys. Res. Commun. 305:407–411), the Intimate Sorting Algorithm (ISort predictor) was developed for predicting the protein subcellular location. As a showcase, the same plant and non- plant protein datasets as investigated by the previous investigators were used for demonstration. The overall success rate by the jackknife test for the plant protein dataset was 85.4%, and that for the non-plant protein dataset 91.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross validation test procedure, further conﬁrming that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. J. Cell. Biochem. 91: 1197 – 1203, 2004. ß 2004 Wiley-Liss, Inc. Key words: Intimate Sorting Algorithm; protein subcellular location; functional domain composition; pseudo-amino acid composition; InterPro database Given the sequence of a protein, how can we predict which subcellular location it belongs to? This is currently a very hot topic in molecular and cell biology because the localization of a protein in a cell is closely correlated with its biological function (see, e.g., [Watson et al., 1987; Alberts et al., 1994; Lodish et al., 1995; Chou et al., 1998, 1999]). Also, the number of protein sequences entering into databanks has been rapidly increasing. It is anticipated that many more new protein sequences will be derived soon owing to the recent success of the human genome project, which has provided an enormous amount of genomic information in the form of 3 billion bp, assembled into tens of thousands of genes. In view of this, the impor- tance to deal with such a problem is not only self- evident, but the challenge will also become even more critical and urgent in the near future. Actually, many efforts have been made trying to develop different computational methods for fast predicting the subcellular locations of pro- teins [Nakai and Kanehisa, 1992; Nakashima and Nishikawa, 1994; Cedano et al., 1997; Claros et al., 1997; Reinhardt and Hubbard, 1998; Chou and Elrod, 1999; Emanuellson et al., 2000; Pan et al., 2003; Zhou and Doctor, 2003]. Of these methods, some [Nakai and Kanehisa, 1992; Claros et al., 1997] were based on the N-terminal sorting signals. Their merit is with a clear biological implication because newly- synthesized proteins in vivo are governed by an intrinsic signal sequence to their destination, whether they are to pass through a membrane into a particular organelle, to become integrated into the membrane, or to be exported out of the cell [Blobel, 1976]. However, as pointed out by Reinhardt and Hubbard [1998], ‘‘in large ß 2004 Wiley-Liss, Inc. *Correspondence to: Kuo-Chen Chou, Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130. E-mail: lifescience@san.rr.com Received 12 October 2003; Accepted 24 October 2003 DOI 10.1002/jcb.10790