The Influence of Noisy Patterns in the Performance of Learning Methods in the Splice Junction Recognition Problem Ana C. Lorena Gustavo E. A. P. A. Batista Andr´ e C. P. L. F. de Carvalho Maria C. Monard Universidade de S˜ ao Paulo Instituto de Ciˆ encias Matem´ aticas e de Computac ¸˜ ao Laborat´ orio de Inteligˆ encia Computacional Av. Trabalhador S˜ ao Carlense, 400 S˜ ao Carlos, Brazil {aclorena, gbatista, andre, mcmonard}@icmc.usp.br Abstract Since the beginning of the Human Genome Project, which aims at sequencing all the human’s genetic informa- tion, a large amount of sequence data has been generated. Much attention is now given to the analysis of this data. A great part of these analysis is carried out with the use of intelligent computational techniques. However, many of the genetic databases are characterized by the presence of noisy data, which can deteriorate the performance of the computational techniques applied. This work studies the in- fluence of noisy data in the training of three different lear- ning methods: Decision Trees, Artificial Neural Networks and Support Vector Machines. The task investigated is the recognition of splice junctions in DNA sequences, which is part of the gene identification problem. Results indicate that the elimination of noisy patterns from the dataset can im- prove the learning algorithms’ performance, with no signi- ficant reduction in their generalization ability. 1. Introduction The use of computational techniques in the solution of biological problems has been explored in several works. Historical contributions, like the MyCin system [4], used in the diagnostic and prescription of treatments to infec- tious diseases, were achieved through this interdisciplinary knowledge cooperation. Nowadays, many of the computational applications in the Biology domain are dedicated to the analysis of genetic data. This fact culminated with the emergence of a new re- search field, Bioinformatics. Bioinformatics can be defined as the application of com- putational, mathematical and statistical techniques in bio- logical data management [1]. The interest in this area has intensified with the efforts involved in the Human Genome Project, which aims at sequencing all the human genome (genetic information). In parallel, the genomes of other se- lected species are also being sequenced, aiming to help the comprehension of the human genetic information. This process is creating a large amount of data that needs to be analyzed. Performing this task in laboratories is harmful and costly. The use of computational techniques may largely reduce the cost and effort associated with these tasks. Intelligent algorithms from Machine Learning (ML) are among the most frequently used techniques [6]. This work investigates the use of three different ML tech- niques, Decision Trees [11], Artificial Neural Networks [8] and Support Vector Machines [12] for the splice junction recognition problem, which is part of the gene identifica- tion task. Since many of the genetic databases generated are cha- racterized by the presence of noisy patterns [7], the influ- ence of these cases in the learning process is also evaluated. For such, a careful removal of these patterns considered to be less reliable (possible noises) was performed using the Tomek links heuristic [18]. Although ML techniques should deal with noisy data, the presence of these patterns can af- fect the algorithms’ induction of the target hypothesis [10]. This work is organized as follows: Section 2 gives a brief overview of the Bioinformatics field and the main concepts necessary for the understanding of the computational appli- cations in this domain. Section 3 describes the splice junc- tion recognition problem. Section 4 describes the data pre- processing employed in this work. Section 5 presents the learning methods considered. Section 6 describes the ex- periments conducted and the results obtained are presented in Section 7. Section 8 analyzes the results presented and concludes this paper.