Predicting Nuclear Localization John Hawkins, Lynne Davis, Mikael Bod´en ARC Centre for Complex Systems School of Information Technology and Electrical Engineering University of Queensland, QLD 4072, Australia. January 11, 2007 Abstract Nuclear localization of proteins is a crucial element in the dynamic life of the cell. It is complicated by the massive diversity of targeting signals and the existence of proteins that shuttle between the nucleus and cytoplasm. In spite of this, the majority of subcellular localization tools that predict nuclear proteins have been developed without involving dual localized proteins in the data sets. Hence, in general, the existing models are focused on predicting statically nuclear proteins, rather than nuclear localization itself. We present an independent analysis of existing nuclear localization predictors, using a non-redundant data set extracted from Swiss-Prot R50.0. We demonstrate that accuracy on truly novel proteins is lower than the previous estimations, and that existing models generalize poorly to dual localized proteins. We develop a model trained to identify nuclear proteins including dual localized proteins. The results suggest that using more recent data and including dual localized proteins improves the overall prediction. The final predictor Nucleo operates with a realistic success rate of 0.70 and a correlation coefficient of 0.38, as established on the independent test set. Nucleo is available at: http://pprowler.itee.uq.edu.au Contact: jhawkins@itee.uq.edu.au 1 Introduction An essential feature of eukaryotic cells is the segregation of the genetic material from the rest of the cell via a nuclear membrane. The separation allows the cell to regulate those molecules that can interact with the genome through the process of membrane transport. However, due to the role the nucleus plays in information processing, the transport mechanism itself must accommodate the import of housekeeping proteins, a process for exporting RNA and a process by which information about the changing environment of the cell can be imported to affect the transcription of the genome. 1 Thus nuclear localization is much more than the mere functional compartmentalization that is observed in most subcellular localization processes. Nuclear localization is a complicated set of processes that play a crucial role in the dynamical self regulation of the cell. 2 At present there are only two prediction services designed to specifically identify proteins imported into the nucleus: PredictNLS 3 and NucPred. 4 In addition to these specialized models there are a number of general purpose subcellular localization predictors that include the nucleus in their list of targets. Hence in order to investigate our current capacity to identify nuclear proteins and extensively benchmark our own predictor we include these general purpose predictors. In this study we develop a model based on a Support Vector Machine (SVM) with a custom kernel to identify proteins that are localized to the Nucleus, either temporarily or permanently. The kernel employs a composite spectrum (or multiple k-mer) encoding conjoined with a bit vector indicating the presence or absence of a range of sequence motifs known to be important for nuclear proteins. The model is evaluated and compared against the existing suite of nuclear localization predictors, using an independent non-redundant data set. 1