International Journal of Database Theory and Application Vol. , No. , March, 2009 11 Finding Frequently Occurred Tree Patterns without Candidate Subtrees Maintenance Juryon Paik Junghyun Nam Ung Mo Kim Dept. of Computer Eng. Dept. of Computer Science Dept. of Computer Eng. Sungkyunkwan Univ. Konkuk Univ. Sungkyunkwan Univ. wise96@ece.skku.ac.kr jhnam@kku.ac.kr umkim@ece.skku.ac.kr Abstract The most commonly adopted approach to find valuable information from trees data is to extract frequently occurring subtree patterns from them. Because mining frequent tree patterns has a wide range of applications such as xml mining, web usage mining, bioinformatics, and network multicast routing, many algorithms have been recently proposed to find the patterns. However, existing tree mining algorithms suffer from several serious pitfalls in finding frequent tree patterns from massive tree datasets. Some of the major problems are due to (1) the computationally high cost of the candidate maintenance, (2) the repetitious input dataset scans, and (3) the high memory dependency. These problems stem from that most of these algorithms are based on the well-known apriori algorithm and have used anti-monotone property for candidate generation and frequency counting in their algorithms. To solve the problems, we base a pattern-growth approach rather than the apriori approach, and choose to extract maximal frequent subtree patterns instead of frequent subtree patterns. We would present some new theorems derived from and evaluate the effectiveness of the proposed algorithm in comparison to the previous works. 1. Introduction 1.1. Motivation One of the most general approaches for modeling complex structured data is to prescribe the data with tree structure. In the database area [10, 13], XML documents are rooted trees where the nodes represent elements or attributes and the edges represent element-subelement and attribute-value relationships. In Web traffic mining, access trees are used to represent the access patterns of different users [2]. In the analysis of molecular evolution, an evolutionary tree (or phylogeny) is used to describe the evolution history of certain species [18]. In computer networking, multicast trees are used for packet routing [4]. With the ever-increasing amount of available tree data, the ability to extract valuable information from them becomes increasingly important and desirable. However, the problem of finding information on tree data has not been extensively studied, in spite of its applicability to a variety of problems. The first step toward finding information from trees is to mine the subtrees frequently occurring in the trees. Frequent subtrees in a database of trees provide useful knowledge in many cases such as gaining general information of data sources, mining of association rules, classification as well as clustering, and helping standard database indexing [5]. However, the discovery of frequent subtrees appearing in a large-scaled tree dataset is not an easy task. As observed in Chi et al's paper [7], due to combinatorial explosion, the number of frequent subtrees usually grows exponentially with the size (number of nodes) of the tree and, therefore, mining all frequent subtrees becomes infeasible.