Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred 1,2 and Dimitar Kazakov 1 1 York University, Computer Science Department, Heslington YO105DD York, England 2 On Study Leave from Universiti Malaysia Sabah, School of Engineering and Information Technology, 88999, Kota Kinabalu, Sabah, Malaysia {ralfred, kazakov}@cs.york.ac.uk Abstract. In solving the classification problem in relational data mining, traditional methods, for example, the C4.5 and its variants, usually require data transformations from datasets stored in multiple tables into a single table. Unfortunately, we may loss some information when we join tables with a high degree of one-to-many association. Therefore, data transformation becomes a tedious trial-and-error work and the classification result is often not very promising especially when the number of tables and the degree of one-to-many association are large. In this paper, we propose a genetic semi-supervised clustering technique as a means of aggregating data in multiple tables for the classification problem in relational database. This algorithm is suitable for classification of datasets with a high degree of one-to-many associations. It can be used in two ways. One is user-controlled clustering, where the user may control the result of clustering by varying the compactness of the spherical cluster. The other is automatic clustering, where a non-overlap clustering strategy is applied. In this paper, we use the latter method to dynamically cluster multiple instances, as a means of aggregating them, and illustrate the effectiveness of this method using the semi-supervised genetic algorithm-based clustering technique. Keywords: Aggregation, Clustering, Semi-supervised clustering, Genetic Algorithm, Relational data Mining. 1 Introduction Relational database requires effective and efficient ways of extracting pattern based on content stored in multiple tables. In this process, significant features must be extracted from datasets stored in multiple tables with one-to-many relationships format. In relational database, a record stored in a target table can be associated to one or more records stored in another table due to the one-to-many association constraint. Traditional data mining tools require data in relational databases to be transformed into an attribute-value format by joining multiple tables. However, with the large Y. Ioannidis, B. Novikov and B. Rachev (Eds.): Local Proceedings of ADBIS 2007, pp. 136-147 © Technical University of Varna, 2007