MultiNNProm: A Multi-Classifier System for Finding Genes Romesh Ranawana, Vasile Palade University of Oxford, Computing Laboratory, Oxford, OX12HY, United Kingdom Abstract. The computational identification of genes in DNA sequences has become an issue of crucial importance due to the large number of DNA molecules being currently sequenced. We present a novel neural network based multi-classifier system, MultiNNProm, for the identification of promoter regions in E.Coli 1 DNA sequences. The DNA sequences were encoded using four different encoding methods and were used to train four different neural networks. The classification results of these neural networks were then aggregated using a variation of the LOP method. The aggregating weights used within the modified LOP aggregating algorithm were obtained through a genetic algorithm. We show that the use of different neural networks, trained on the same set of data, could provide slightly varying results if the data were differently encoded. We also show that the combination of more neural classifiers provides us with better accuracy than the individual networks. Keywords: Multi-Classifier Systems, Neural-Networks, E.Coli, Promoters, Genetic Algorithms 1 Introduction The recognition of genes in strings of DNA has become an important issue due to the vast amount of gene sequences being currently produced around the world. Once a DNA molecule has been sequenced, it is nec- essary to first identify the genes and protein coding regions within the sequence, and then understand their functionality. It is both economically and practically impossible for these steps to be conducted manually, and thus, the development of accurate and efficient computational algorithms for the recognition of genes in DNA sequences has become an urgent necessity. It is impractical to design traditional algorithms for the recognition of genes within these sequences due to the complexity of the task at hand. The main goal of gene classification is generalization, that is, being able to induce a concept that accurately classifies genes not included in the training set. The difficulty with achieving good generalization capability lies in the fact that a learner cannot directly measure the generaliza- tion ability; and instead, the learner is forced into relying on its inductive bias to hopefully produce an accu- rate classifier. A proper classification system should ideally be able to identify genes which were not in- cluded in the training set. Machine learning approaches, therefore, offer promising solutions to solve complex problems, due to their capabilities in identifying patterns and genetic concepts automatically. Some of the more successful examples of gene recognition methods include the use of neural networks (Snyder and Stormo, 1995; Uberbacher and Mural, 1991; Ma et al., 2002), Decision Trees (Salzberg et al., 1997), Genetic Programming (Koza and Andre, 1995), Hidden Markov Models (Henderson et al., 1997; Kulp et al., 1996; Krogh, 1996; Birney, 2001) and Fuzzy Logic (Ohno-Machado et al., 1998; Woolf and Yang., 2000). Promoter regions are known to be situated immediately before genes. Thus, the successful identifi- cation of a promoter leads to the identification of the starting position of a gene. We propose a neural net- work based multi-classifier system, MultiNNProm, for the identification of these promoter sites in E.Coli 1 DNA sequences. The proposed system contains four neural networks, to which the same DNA sequence is presented using four different encoding methods. The outputs of the individual neural networks are then passed through a probability function and finally aggregated by a combination function. This combinatorial function then accumulates the results to provide an answer as to whether the presented sequence is an E.Coli promoter or not. 1 Escherichia coli, a species of bacterium normally present in intestinal tract of humans and other animals; sometimes pathogenic; can be a threat to food safety