Aggregating Multiple Instances in Relational Database
Using Semi-Supervised Genetic Algorithm-based
Clustering Technique
Rayner Alfred
1,2
and Dimitar Kazakov
1
1
York University, Computer Science Department, Heslington
YO105DD York, England
2
On Study Leave from Universiti Malaysia Sabah,
School of Engineering and Information Technology,
88999, Kota Kinabalu, Sabah, Malaysia
{ralfred, kazakov}@cs.york.ac.uk
Abstract. In solving the classification problem in relational data mining,
traditional methods, for example, the C4.5 and its variants, usually require data
transformations from datasets stored in multiple tables into a single table.
Unfortunately, we may loss some information when we join tables with a high
degree of one-to-many association. Therefore, data transformation becomes a
tedious trial-and-error work and the classification result is often not very
promising especially when the number of tables and the degree of one-to-many
association are large. In this paper, we propose a genetic semi-supervised
clustering technique as a means of aggregating data in multiple tables for the
classification problem in relational database. This algorithm is suitable for
classification of datasets with a high degree of one-to-many associations. It can
be used in two ways. One is user-controlled clustering, where the user may
control the result of clustering by varying the compactness of the spherical
cluster. The other is automatic clustering, where a non-overlap clustering
strategy is applied. In this paper, we use the latter method to dynamically
cluster multiple instances, as a means of aggregating them, and illustrate the
effectiveness of this method using the semi-supervised genetic algorithm-based
clustering technique.
Keywords: Aggregation, Clustering, Semi-supervised clustering, Genetic
Algorithm, Relational data Mining.
1 Introduction
Relational database requires effective and efficient ways of extracting pattern based
on content stored in multiple tables. In this process, significant features must be
extracted from datasets stored in multiple tables with one-to-many relationships
format. In relational database, a record stored in a target table can be associated to one
or more records stored in another table due to the one-to-many association constraint.
Traditional data mining tools require data in relational databases to be transformed
into an attribute-value format by joining multiple tables. However, with the large
Y. Ioannidis, B. Novikov and B. Rachev (Eds.): Local Proceedings of ADBIS 2007, pp. 136-147
© Technical University of Varna, 2007