Simultaneous Optimization of Complex Mining Tasks with a Knowledgeable Cache Ruoming Jin Kaushik Sinha Gagan Agrawal Department of Computer Science and Engineering Ohio State University, Columbus OH 43210 {jinr,sinhak,agrawal}@cse.ohio-state.edu ABSTRACT With an increasing use of data mining tools and techniques, we envision that a Knowledge Discovery and Data Mining System (KDDMS) will have to support and optimize for the following sce- narios: 1) Sequence of Queries: A user may analyze one or more datasets by issuing a sequence of related complex mining queries, and 2) Multiple Simultaneous Queries: Several users may be ana- lyzing a set of datasets concurrently, and may issue related complex queries. This paper presents a systematic mechanism to optimize for the above cases, targeting the class of mining queries involving fre- quent pattern mining on one or multiple datasets. We present a system architecture and propose new algorithms to simultaneously optimize multiple such queries and use a knowledgeable cache to store and utilize the past query results. We have implemented and evaluated our system with both real and synthetic datasets. Our ex- perimental results show that our techniques can achieve a speedup of up to a factor of 9, compared with the systems which do not support caching or optimize for multiple queries. Categorization and Subject Descriptions: H.2.8 [Database Ap- plications]: Data Mining General Terms: Algorithms Keywords: Frequent pattern mining, multiple query optimization, knowledgeable cache 1. INTRODUCTION As the amount of data available for analysis in both scientific and commercial domains is increasing dramatically, efficiency in the data mining process is likely to become the crucial issue. With an increasing use of data mining tools and techniques, we envision that a Knowledge Discovery and Data Mining System (KDDMS) will have to support and optimize for the following scenarios: • Sequence of Queries: A user may analyze one or more datasets by issuing a sequence of related complex mining queries. This may be due to the iterative and exploratory nature of the process, where the mining parameters and constraints are modified till desired insights are gained from the dataset(s). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00. • Multiple Simultaneous Queries: Several users may be ana- lyzing a set of datasets concurrently, and may issue related complex queries. In this paper, we focus on the problem of efficiently evaluating an important class of complex mining queries in a query intensive environment, where one needs to optimize multiple simultaneous queries, as well as a sequence of related queries. The class of com- plex mining queries we target are the ones involving frequent pat- tern mining on one or multiple datasets. Particularly, we show how multiple simultaneous queries can be optimized, and how the re- sults from past mining queries can be utilized to evaluate the current ones. Due to the complexity and characteristics of such queries, si- multaneous optimization of multiple queries and caching of their query results is challenging, and quite different from the existing work in this area. 1.1 Related Work The need for supporting and optimizing such scenarios has been well recognized in database and OLAP systems. Views have been used to optimize a sequence of database operations [5], and simi- larly, techniques such as reducing common subexpressions [12, 11] have been used. However, because the nature of the mining oper- ations is very different from nature of database and OLAP opera- tions, these techniques cannot apply to a KDDMS system. Some efforts have been made towards addressing these issues for mining environments. Nag et al. have studied how a knowledge- able cache can be used to help to perform interactive discovery of association rules [9]. They maintain a cache to record (in)frequent itemsets with their support levels, and then modify the frequent itemset mining algorithm to utilize the itemsets in the cache. The focus of their research is on frequent itemset mining without com- plex mining conditions. Ng et al. have studied constraint associ- ation rule mining [10]. In their method, multiple queries can be merged as a single query for evaluation. Hipp and Guntzer have argued that execution of data mining queries with constraints can be very expensive [6]. Therefore, they have proposed to use pre- computation of frequent itemsets of certain support levels to an- swer constraint itemset mining queries. However, in these stud- ies, sequence of queries and multiple simultaneous queries have not been studied together, and the techniques involving the use of knowledgeable cache have been restricted to deal with simple data mining queries. 1.2 Preliminaries: SQL Extensions, Algebra, and M-Table for Frequent Pattern Mining on Multiple Datasets Frequent pattern mining focuses on discovering frequently ap- pearing sub-structures in datasets. The structures explored include 600 Research Track Poster