Toward Multidatabase Mining: Identifying Relevant Databases Huan Liu, Senior Member, IEEE, Hongjun Lu, and Jun Yao AbstractÐVarious tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question is where we should start mining. It is not true that the more databases, the better for data mining. It is only true when the databases involved are relevant to a task at hand. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with an objective of finding patterns or regularities about certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application. Index TermsÐMultiple databases, data mining, query, relevance measure. æ 1 INTRODUCTION W ITH more and more databases created, an increasingly pressing issue is how to make efficient use of them. To address the issue, there has been a recent surge of research interest on knowledge discovery and data mining [1], [10], [7], [33]. While researchers are trying to develop efficient algorithms to cope with large volumes of data, little work has been devoted to the data aspect in the knowledge discovery process. In most organizations, data is rarely specially collected and stored for the purpose of mining knowledge, but usually as the byproducts of other tasks [32]. Furthermore, with the development of technologies, it is not uncommon that an organization has a large number of database systems and diverse data sources. Although most data mining algorithms assume a single data set, for real world applications, practitioners have to face the problem of discovering knowledge from multiple databases. In order to do so, one way is to employ a brute force approach to join the available tables into a single large table upon which existing data mining techniques or tools can be applied. There are several problems for this approach in real world applications. First, database integra- tion itself is still a problematic area, especially where the source domains differ. Second, all tables with foreign key references need to be joined together to produce a single combined table. The size of the resulting table, in terms of both the number of records and the number of attributes, will be much larger than the original individual tables. The increase of data size not only prolongs the running time of mining algorithms, but also affects the behavior of mining algorithms. From the viewpoint of statistics, joining one relevant database with an irrelevant one will result in a more difficult task to find useful patterns as search space is enlarged by irrelevant attributes. For example, there are two binary-valued databases and each has N attributes, assum- ing that one database is irrelevant and N=2 attributes can be found in both databases. Simply working on the relevant database, the hypothesis space is 2 2 N ; after joining, it is 2 2 3N=2 . For this simplified analysis, we have not yet considered the factor of missing values due to joining. This factor will certainly increase the difficulty of data mining, too. Third, if databases are joined and data mining algorithms are applied, the users face the problem of identifying interesting patterns from a large number of discovered rules. In practice, it is too easy to discover a huge number of patterns in a database [20], [26], [28]; however, it is difficult for users to search in all the discovered patterns for useful ones. The redundant, useless, or uninteresting patterns can be even more easily generated when there are quite a number of databases irrelevant to the mining task. Therefore, as in any effective knowledge discovery process, the first important step in mining multiple databases is indeed to select those databases that are relevant to a specific mining task. In this paper, we will address the relevance problem: identifying databases relevant to a particular data mining task in multiple databases. Without loss of generality, we call each database a relation or a table and assume that 1) a specific mining task is related to the property of certain attributes, which can be expressed using query predicates, and 2) the higher order correlations of attributes with a query are at least partially reflected in their first order correlations. It is important to note that, unlike the conventional query processing, our task is to select the databases relevant to a given query predicate rather than identifying the databases matching the query. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 4, JULY/AUGUST 2001 541 . H. Liu is with the Department of Computer Science and Engineering, Arizona State University, PO Box 875406, Tempe, AZ 85287-5406. E-mail: hliu@asu.edu. . H. Lu is with the Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China. E-mail: luhj@cs.ust.hk. . J. Yao is with Mokonet Internet Inc., 40-31 3FL, 68th St., Woodside, NY 11377. E-mail: yaojun@excite.com. Manuscript received 2 Sept. 1997; accepted 2 Mar. 2000; posted to Digital Library 6 Apr. 2001. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number 105570. 1041-4347/01/$10.00 ß 2001 IEEE