ORIGINAL ARTICLE Aristoklis D. Anastasiadis Æ George D. Magoulas Analysing the localisation sites of proteins through neural networks ensembles Received: 21 July 2004 / Accepted: 10 January 2006 Ó Springer-Verlag London Limited 2006 Abstract Scientists involved in the area of proteomics are currently seeking integrated, customised and validated research solutions to better expedite their work in pro- teomics analyses and drug discoveries. Some drugs and most of their cell targets are proteins, because proteins dictate biological phenotype. In this context, the auto- mated analysis of protein localisation is more complex than the automated analysis of DNA sequences; never- theless the beneﬁts to be derived are of same or greater importance. In order to accomplish this target, the right choice of the kind of the methods for these applications, especially when the data set is drastically imbalanced, is very important and crucial. In this paper we investigate the performance of some commonly used classiﬁers, such as the K nearest neighbours and feed-forward neural networks with and without cross-validation, in a class of imbalanced problems from the bioinformatics domain. Furthermore, we construct ensemble-based schemes using the notion of diversity, and we empirically test their performance on the same problems. The experimental results favour the generation of neural network ensembles as these are able to produce good generalisation ability and signiﬁcant improvement compared to other single classiﬁer methods. Keywords Feedforward neural networks Æ Neural ensembles Æ Protein localisation Æ Imbalanced datasets Æ K-nearest neighbour 1 Introduction The ability to link known proteins through sequence similarity and cellular localisation is becoming increasingly important to accompany information on protein primary sequence and tertiary structure. In particular, the study of protein localisation (in order to function properly, proteins must be transported to various locations within a particular cell) is considered very useful in the post-genomics and proteomics era, as it provides information about each protein that is complementary to the protein sequence and structure data [1]. Two of the most thoroughly studied single-cell organisms are the bacterium, Escherichia coli (E. coli) and the brewer’s yeast, Saccharomyces cerevisiae (S. cerevisiae). Both organisms use mainstream metabolic pathways, the majority of which are recognisable, simi- lar to the corresponding metabolic functions in all life forms including higher eukaryotes. The entire genome sequence has been determined for both organisms. The relations between genetics and biochemistry, which constitute the fundamental processes of life in these single-cell organisms, serve as a foundation for on-going investigations on the processes that operate in the more complex, higher forms of life [2]. E. coli and S. cerevisiae are both well-characterised organisms and can be eﬃ- ciently manipulated genetically. They have rapid growth rate and very simple nutritional requirements. Many studies have detailed both gene and protein expression of these organisms [3, 4]. Recently a neuro-fuzzy ap- proach for functional genomics has been proposed. More precisely, the objective of this approach was to learn and predict the functional classes of the E. coli genes [5]. In this study, we are interested in a diﬀerent problem, which is the prediction of the localisation sites of proteins, such as the E. coli and the S. cerevisiae. The ﬁrst approach for predicting the localisation sites of proteins from their amino acid sequences was an ex- pert system developed by Nakai and Kanehisa [6, 7]. Later, expert identiﬁed features were combined with a probabilistic model, which could learn its parameters from a set of training data [8]. Better prediction accuracy has been achieved by using standard classiﬁcation algorithms, such as K nearest neighbours (KNN), binary A. D. Anastasiadis (&) Æ G. D. Magoulas School of Computer Science and Information Systems, Birkbeck College, University of London, Malet Street, London, WC1E 7HX, UK E-mail: gmagoulas@dcs.bbk.ac.uk E-mail: aris@dcs.bbk.ac.uk Tel.: +44-1895-203397 Fax: +44-1895-211686 Neural Comput & Applic (2006) DOI 10.1007/s00521-006-0029-y