R3P-Loc: A compact multi-label predictor using ridge regression and random projection for protein subcellular localization Shibiao Wan a , Man-Wai Mak a,∗ , Sun-Yuan Kung b a Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China b Department of Electrical Engineering, Princeton University, New Jersey, USA. Abstract Locating proteins within cellular contexts is of paramount signiﬁcance in elucidating their biological functions. Computational methods based on knowledge databases (such as gene ontology annotation (GOA) database) are known to be more eﬃcient than sequence-based methods. However, the predominant scenarios of knowledge-based methods are that (1) knowledge databases typically have enormous size and are growing exponentially, (2) knowledge databases contain redundant information, and (3) the number of extracted features from knowledge databases is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems suﬀer from overﬁtting. To address these problems, this paper proposes an eﬃcient multi-label predictor, namely R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classiﬁer. Two new compact databases are created from Swiss-Prot and GOA databases. These databases possess almost the same amount of information as their full-size counterparts but with much smaller size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by seven folds and signiﬁcantly outperforms state-of-the-art predictors. This paper also demonstrates that the compact databases reduce the memory consumption by 39 times without causing degradation in prediction accuracy. For readers’ convenience, the R3P-Loc server is available online at http://bioinfo.eie.polyu.edu.hk/R3PLocServer/. Keywords: Multi-location proteins; Compact databases; Protein subcellular localization; Random projection; Multi-label classiﬁcation. 1. Introduction Most eukaryotic proteins are synthesized in the cytosol and must be transported to the correct spatiotemporal cellular con- texts to perform their biological functions. The knowledge of protein subcellular localization helps biologists elucidate the functions of proteins and identify drug targets [1, 2]. Mislo- calization of proteins within cells may lead to a broad range of human diseases, such as breast cancer [3], kidney stone [4], Alzheimer’s disease [5], Bartter syndrome [6], primary human liver tumors [7], minor salivary gland tumors [8] and pre-eclampsia [9]. Conventionally, high quality localization databases are obtained by wet-lab experiments such as cell fractionation, ﬂuorescent microscopy imaging and electron mi- croscopy, which are also regarded as gold standard for validat- ing subcellular localization. These methods, however, are la- borious and costly, especially for the avalanche of newly dis- covered protein sequences in the post-genomic era. There- fore, computational methods are required to assist biologists for large-scale protein subcellular localization. ∗ Corresponding author Email addresses: 10900600r@connect.polyu.hk (Shibiao Wan), enmwmak@polyu.edu.hk (Man-Wai Mak), kung@princeton.edu (Sun-Yuan Kung) Recent decades have witnessed remarkable progress of com- putational methods for predicting subcellular localization of proteins, which can be roughly divided into sequence-based and knowledge-based. Sequence-based methods include: (1) sorting-signals based methods [10, 11, 12], such as using sig- nal peptides, which can be predicted by signal peptide pre- dictors like Signal-CF [13] and Signal-3L [14]; (2) amino- acid composition-based methods [15, 16, 17, 18, 19, 20]; and (3) homology-based methods [21, 22, 23]. Knowledge-based methods use information from knowledge databases, such as Gene Ontology (GO) 1 terms [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34], Swiss-Prot keywords [35, 36], functional domains [37], or PubMed abstracts [38, 39]. Among them, GO-based meth- ods have demonstrated to be superior to methods based on other features [27, 40, 41, 42]. Because some proteins can exist in more than one organelle in a cell [43, 44, 45, 46], recent researches have been focusing on predicting both single- and multi-location proteins. In fact, multi-location proteins play important roles in some metabolic processes that take place in more than one cellular compart- ment, e.g., fatty acid β-oxidation in the peroxisome and mito- chondria, and antioxidant defense in the cytosol, mitochondria and peroxisome [47]. 1 http://www.geneontology.org Preprint submitted to xxxx Journal June 23, 2014