The Australasian Data Mining Workshop Copyright 2002 Building a Data Mining Query Optimizer Raj P. Gopalan Tariq Nuruddin Yudho Giri Sucahyo School of Computing Curtin University of Technology Bentley, Western Australia 6102 {raj, nuruddin, sucahyoy} @computing.edu.au ABSTRACT In this paper, we describe our research into building an optimizer for association rule queries. We present a framework for the query processor and report on the progress of our research so far. An extended SQL syntax is used for expressing association rule queries, and query trees of operators in an extended relational algebra for their internal representation. The placement of constraints in the query tree is discussed. We have developed an efficient algorithm called CT-ITL for lower level implementation of frequent item set generation which is the most critical step of association rule mining. The performance evaluations show that our algorithm compares well with the most efficient algorithms available currently. We also discuss further steps needed to reach our goal of integrating the optimizer with database systems. Keywords Knowledge discovery and data mining, query optimization, association rules, frequent item sets. 1. INTRODUCTION Data mining is used to automatically extract structured knowledge from large data sets. Among the different topics of research in data mining, the efficient discovery of association rules has been one of the most active. Association rules identify relationships among sets of items in a transaction database. They have a number of applications such as increasing the effectiveness of marketing, inventory control and stock location on the shop floor. Since the introduction of association rules in [1], many researchers have developed increasingly efficient algorithms for their discovery [3], [8], [11], [19]. As the computational complexity of association rules discovery is very high, most of the research effort has focused on the design of efficient algorithms. These algorithms generally use simple flat files as input and are not integrated with database systems. Imielinski and Mannila [9] identified the need to develop data mining systems similar to DBMSs that manage business applications. They also suggested developing features such as query languages and query optimizers for these systems. A number of tightly coupled integration schemes between data mining and database systems have been reported. Agrawal and Shim described the integration of data mining with IBM DBS/CS RDBMS using User Defined Function (UDF) [2]. Exploration of different architectural alternatives for database integration and comparisons between them on performance, storage overhead, parallelization, and inter-operability were presented in [17]. Recently, the impact of file structures and systems software support on mining algorithms was discussed in [16]. However, most of these proposals focus on enhancing existing DBMSs for Data Mining and do not deal with the problem of integrating the data mining algorithms with database technology to support data mining applications. Several researchers have proposed query languages for discovering association rules. Imielinski et al. introduced the MINE operator as a query language primitive that can be embedded in a general programming language to provide an Application Programming Interface for database mining [10]. Han et al. proposed DMQL as another query language for data mining of relational databases [7]. Meo et al. described MINE RULE as an extension to SQL, including examples dealing with several variations to the association rule specifications [12]. However, all these proposals focus on language specifications rather than algorithms or techniques for optimizing the queries. Chaudhuri suggested that data mining should take advantage of the primitives for data retrieval provided by database systems [4]. However, the operators used for implementing SQL are not sufficient to support data mining applications [9]. Meo et al. gave the semantics of the MINE RULE operator using a set of nested relational algebra operators [12]. Probably because their objective was only to describe the semantics of MINE RULE, the expressions of their algebra are far too complex for internal representation of queries or for performing optimization. In this paper, we discuss our research into building a mining query optimizer that can be integrated with database systems using a common algebra for the internal representation of both data mining and database queries. We describe the general framework of the optimizer and report on our progress so far. We have focused our efforts on the critical components needed for building the optimizer. We specify an extended SQL syntax for expressing association rule queries. The queries are represented internally as query trees of an extended relational algebra. The algebraic operators are grouped into modules to simplify the query tree. Several constraints that reduce the number of association rules generated can be integrated with the different modules. Alternative execution plans may be generated, by assigning algorithms to implement various operations in the query tree, from which an optimal plan can be chosen based on cost estimates. We have developed an efficient algorithm called CT-ITL for lower level implementation of frequent item set generation which is the most critical step of association rule mining. The performance evaluations show that our algorithm compares well with the most efficient algorithms available currently. The algorithm and comparisons of its performance with other well- known algorithms on some typical test data sets are presented. We