1 6 Tree and Graph Mining Dimitrios Katsaros Aristotle University, Greece Copyright © 2005, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited. Yannia Manolopoulos Aristotle Univeristy, Greece INTRODUCTION During the past decade, we have witnessed an explosive growth in our capabilities to both generate and collect data. Various data mining techniques have been pro- posed and widely employed to discover valid, novel and potentially useful patterns in these data. Data mining involves the discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in huge collections of data. One of the key success stories of data mining re- search and practice has been the development of effi- cient algorithms for discovering frequent itemsets – both sequential (Srikant & Agrawal, 1996) and non- sequential (Agrawal & Srikant, 1994). Generally speak- ing, these algorithms can extract co-occurrences of items (taking or not taking into account the ordering of items) in an efficient manner. Although the use of sets (or sequences) has effectively modeled many applica- tion domains, like market basket analysis, medical records, a lot of applications have emerged whose data models do not fit in the traditional concept of a set (or sequence), but require the deployment of richer ab- stractions, like graphs or trees. Such graphs or trees arise naturally in a number of different application domains including network intrusion, semantic Web, behavioral modeling, VLSI reverse engineering, link analysis and chemical compound classification. Thus, the need to extract complex tree-like or graph- like patterns in massive data collections, for instance, in bioinformatics, semistructured or Web databases, be- came a necessity. The class of exploratory mining tasks, which deal with discovering patterns in massive databases representing complex interactions among entities, is called Frequent Structure Mining (FSM) (Zaki, 2002). In this article we will highlight some strategic appli- cation domains where FSM can help provide significant results and subsequently we will survey the most impor- tant algorithms that have been proposed for mining graph-like and tree-like substructures in massive data collections. BACKGROUND As a motivating example for graph mining consider the problem of mining chemical compounds to discover recurrent (sub) structures. We can model this scenario using a graph for each compound. The vertices of the graphs correspond to different atoms and the graph edges correspond to bonds among the atoms. We can assign a label to each vertex, which corresponds to the atom involved (and maybe to its charge) and a label to each edge, which corresponds to the type of the bond (and maybe to information about the 3D orientation). Once these graphs have been generated, recurrent sub- structures become frequently occurring subgraphs. These graphs can be used in various tasks, for instance, in classifying chemical compounds (Deshpande, Kuramochi, & Karypis, 2003). Another application domain where graph mining is of particular interest arises in the field of Web usage analysis (Nanopoulos, Katsaros, & Manolopoulos, 2003). Although various types of usage (traversal) pat- terns have been proposed to analyze the behavior of a user (Chen, Park, & Yu, 1998), they all have one very significant shortcoming; they are one-dimensional pat- terns and practically ignore the link structure of the site. In order to perform finer usage analysis, it is possible to look at the entire forward accesses of a user and to mine frequently accessed subgraphs of that site. Looking for examples where tree mining has been successfully applied, we can find a wealth of them. A characteristic example is XML, which has been a very popular means for representing and storing information of various kinds, because of its modeling flexibility. Since tree-structured XML documents are the most widely occurring in real applications, one would like to discover the commonly occurring subtrees that appear in the collections. This task could benefit applications, like database caching (Yang, Lee, & Hsu, 2003), storage in relational databases (Deutsch, Fernandez, & Suciu, 1999), building indexes and/or wrappers (Wang & Liu, 2000) and many more.