1
6
Tree and Graph Mining
Dimitrios Katsaros
Aristotle University, Greece
Copyright © 2005, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Yannia Manolopoulos
Aristotle Univeristy, Greece
INTRODUCTION
During the past decade, we have witnessed an explosive
growth in our capabilities to both generate and collect
data. Various data mining techniques have been pro-
posed and widely employed to discover valid, novel and
potentially useful patterns in these data. Data mining
involves the discovery of patterns, associations, changes,
anomalies, and statistically significant structures and
events in huge collections of data.
One of the key success stories of data mining re-
search and practice has been the development of effi-
cient algorithms for discovering frequent itemsets –
both sequential (Srikant & Agrawal, 1996) and non-
sequential (Agrawal & Srikant, 1994). Generally speak-
ing, these algorithms can extract co-occurrences of
items (taking or not taking into account the ordering of
items) in an efficient manner. Although the use of sets
(or sequences) has effectively modeled many applica-
tion domains, like market basket analysis, medical
records, a lot of applications have emerged whose data
models do not fit in the traditional concept of a set (or
sequence), but require the deployment of richer ab-
stractions, like graphs or trees. Such graphs or trees
arise naturally in a number of different application
domains including network intrusion, semantic Web,
behavioral modeling, VLSI reverse engineering, link
analysis and chemical compound classification.
Thus, the need to extract complex tree-like or graph-
like patterns in massive data collections, for instance, in
bioinformatics, semistructured or Web databases, be-
came a necessity. The class of exploratory mining tasks,
which deal with discovering patterns in massive databases
representing complex interactions among entities, is
called Frequent Structure Mining (FSM) (Zaki, 2002).
In this article we will highlight some strategic appli-
cation domains where FSM can help provide significant
results and subsequently we will survey the most impor-
tant algorithms that have been proposed for mining
graph-like and tree-like substructures in massive data
collections.
BACKGROUND
As a motivating example for graph mining consider the
problem of mining chemical compounds to discover
recurrent (sub) structures. We can model this scenario
using a graph for each compound. The vertices of the
graphs correspond to different atoms and the graph
edges correspond to bonds among the atoms. We can
assign a label to each vertex, which corresponds to the
atom involved (and maybe to its charge) and a label to
each edge, which corresponds to the type of the bond
(and maybe to information about the 3D orientation).
Once these graphs have been generated, recurrent sub-
structures become frequently occurring subgraphs. These
graphs can be used in various tasks, for instance, in
classifying chemical compounds (Deshpande,
Kuramochi, & Karypis, 2003).
Another application domain where graph mining is
of particular interest arises in the field of Web usage
analysis (Nanopoulos, Katsaros, & Manolopoulos,
2003). Although various types of usage (traversal) pat-
terns have been proposed to analyze the behavior of a
user (Chen, Park, & Yu, 1998), they all have one very
significant shortcoming; they are one-dimensional pat-
terns and practically ignore the link structure of the site.
In order to perform finer usage analysis, it is possible to
look at the entire forward accesses of a user and to mine
frequently accessed subgraphs of that site.
Looking for examples where tree mining has been
successfully applied, we can find a wealth of them. A
characteristic example is XML, which has been a very
popular means for representing and storing information
of various kinds, because of its modeling flexibility.
Since tree-structured XML documents are the most
widely occurring in real applications, one would like to
discover the commonly occurring subtrees that appear
in the collections. This task could benefit applications,
like database caching (Yang, Lee, & Hsu, 2003), storage
in relational databases (Deutsch, Fernandez, & Suciu,
1999), building indexes and/or wrappers (Wang & Liu,
2000) and many more.