Mining Several Databases with an Ensemble of Classifiers Seppo Puuronen University of Jyv skyl , P. O. Box 35, FIN-40351 Jyv skyl , Finland phone: + 358 14 603 028; fax +358 14 603 011 e-mail: sepi@jytko.jyu.fi Vagan Terziyan Kharkov State Technical University of Radioelectronics, 14 Lenin Avenue, 310166 Kharkov, Ukraine e-mail: vagan@kture.cit-ua.net Alexander Logvinovsky Kharkov State Technical University of Radioelectronics, 14 Lenin Avenue, 310166 Kharkov, Ukraine; e-mail: vagan@kture.cit-ua.net Abstract The results of knowledge discovery in databases could vary depending on the data mining method. There are several ways to select the most appropriate data mining method dynamically. One proposed method clusters the whole domain area into competence areas of the methods. A metamethod is then used to decide which data mining method should be used with each database instance. However, when knowledge is extracted from several databases knowledge discovery may produce conflicting results even if the separate databases are consistent. At least two types of conflicts may arise. The first type is created by data inconsistency within the area of the intersection of the databases. The second type of conflicts is created when the metamethod selects different data mining methods with inconsistent competence maps for the objects of the intersected part. We analyze these two types of conflicts and their combinations and suggest ways to handle them. 1. Introduction The modern database technology enables the storage of huge amounts of data, but it does not yet offer high level support to analyze, understand, or visualize the stored data in intelligent ways. Data mining is a nontrivial knowledge extracting process which reveals valid, previously unknown, and comprehensive information from databases discovering useful patterns that are not directly obvious to the user [4]. The research in the field of knowledge discovery in databases has rapidly emerged recently producing several new data mining methods and techniques of their evaluation, learning and integration [3,5,6,9,11,10]. Some of these methods are static and do not analyze a new instance in its context. Dynamic data mining methods, as [8,14,13], take into account the context of a new instance and are even able to take benefit from an ensemble of classifiers. Most of the data mining algorithms assume a single data set, but real world application practitioners have usually to discover knowledge from several databases [7]. Thus knowledge discovery can be seen as a task to apply several data mining methods with several data bases. The brute–force application of statistical methods does not help, because they are unable to take into account their own context and the context of the knowledge they are processing. We develop methods that can take into account inconsistencies of numerical data when the most appropriate method(s) is chosen. One technique to handle a single database using dynamic integration of multiple classifiers is proposed in [12] and developed further in [13]. The technique consists of two phases: the training phase and the classification phase. During the training phase the characteristics and classifications of the training instances are collected using the Jackknife method into the performance matrix Q nxm , where n is the number of the training instances, m is the number of the classification methods, and q ji is equal to 1 if the classification produced by the method i for the instance j is incorrect and 0 otherwise. In the classification phase a new instance is classified using the weighted k -nearest neighbor algorithm. The weights of the k -nearest neighbors are calculated as a function of the distances between the neighbors and the new instance and the most appropriate method is selected based on the weights and the values of the matrix Q nxm . In a way the performance matrix is used as a competence map of the classifiers included in the ensemble. The above technique is planned to be used with a single data base and it cannot be applied as such with several data bases because it cannot handle conflicts arising within intersecting areas of data bases. There are at least two types of problematic conflicts. The first type of conflicts arises when there exist different classification results caused by data inconsistency within the intersecting part of the databases. The second type of conflicts arises when the technique selects different data mining methods (classifiers) for the instances of the intersected part and there is inconsistency between the competence maps of the classification methods.