Subdue: Compression-Based Frequent Pattern Discovery in Graph Data Nikhil S. Ketkar University of Texas at Arlington ketkar@cse.uta.edu Lawrence B. Holder University of Texas at Arlington holder@cse.uta.edu Diane J. Cook University of Texas at Arlington cook@cse.uta.edu ABSTRACT A majority of the existing algorithms which mine graph datasets target complete, frequent sub-graph discovery. We describe the graph-based data mining system Subdue which focuses on the discovery of sub-graphs which are not only fre- quent but also compress the graph dataset, using a heuristic algorithm. The rationale behind the use of a compression- based methodology for frequent pattern discovery is to pro- duce a fewer number of highly interesting patterns than to generate a large number of patterns from which interesting patterns need to be identified. We perform an experimental comparison of Subdue with the graph mining systems gSpan and FSG on the Chemical Toxicity and the Chemical Com- pounds datasets that are provided with gSpan. We present results on the performance on the Subdue system on the Mu- tagenesis and the KDD 2003 Citation Graph dataset. An analysis of the results indicates that Subdue can efficiently discover best-compressing frequent patterns which are fewer in number but can be of higher interest. 1. INTRODUCTION Recently, an increasing body of research has focused on de- veloping algorithms to mine graph datasets. A graph rep- resentation provides a natural way to express relationships within data. Graph-based data mining expresses data in the form of graphs, and focuses on the the discovery of interest- ing sub-graph patterns. Graph-based data mining has been successfully applied to various application domains including protein analy- sis[19], chemical compound analysis[1], link analysis[13] and web searching[16]. A number of varied techniques and methodologies have been applied to mining interesting sub- graph patterns from graph datasets. These include math- ematical graph theory based approaches like FSG[10] and gSpan[20], greedy search based approaches like Subdue [2] or GBI[12], inductive logic programming (ILP) approaches like WARMR[3], inductive database approaches like MolFea[15] and kernel function based approaches[8]. Mathematical graph theory based approaches mine a com- plete set of subgraphs mainly using a support or frequency measure. The initial work in this area was the AGM[6] system which uses the Apriori level-wise approach. FSG takes a similar approach and further optimizes the algo- rithm for improved running times. gFSG [9] is a variant of FSG which enumerates all geometric subgraphs from the database. gSpan uses DFS codes for canonical labeling and is much more memory and computationally efficient than previous approaches. Instead of mining all subgraphs, Close- Graph[21] only mines closed subgraphs. A graph G is closed in a dataset if there exists no supergraph of G that has the same support as G. Gaston [14] efficiently mines graph datasets by first considering frequent paths which are trans- formed to trees which are further transformed to graphs. FFSM [5] is a graph mining system which uses an algebric graph framework to address the underlying problem of sub- graph isomorphism. In comparison to mathematical graph theory based approaches which are complete, greedy search based approaches use heuristics to evaluate the solution. The two pioneering works in the field are Subdue and GBI. Subdue uses MDL-based compression heuristics, and GBI uses an empirical graph size-based heuristic. The empirical graph size definition depends on the size of the extracted patterns and the size of the compressed graph. Another methodology in this field is that of inductive logic program- ming which has the advantage of the extensive descriptive power of first-order logic. The first graph-based system to combine the ILP method with Apriori-like level-wise search was WARMR. The major advantage of these approaches is their high representation power. WARMR was used on car- cinogenesis prediction of chemical compounds [7]. Another promising direction in the field of graph-based data mining is that of inductive databases which are a new gener- ation of databases that are not only capable of dealing with data but also with patterns or regularities within the data. Data mining in such a framework is an interactive querying process. The inductive database framework is especially in- teresting for bioinformatics and chemoinformatics, because of the large and complex databases that exist in these do- mains, and the lack of methods to gain scientific knowledge from them. The pioneer work in this field was the MolFea system, which is based on the level-wise version space algo- rithm. MolFea is the Molecular Feature miner that mines for linear fragments in chemical compounds. Lastly, the kernel function based approaches have been used to a cer- tain extent for mining graph datasets. The kernel function Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. OSDM’05, August 21, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-210-0/05/08 ...$5.00. 71