Mining Frequent Closed Unordered Trees Through Natural Representations Jos´ e L. Balc´ azar, Albert Bifet and Antoni Lozano Universitat Polit` ecnica de Catalunya, {balqui,abifet,antoni}@lsi.upc.edu Abstract. Many knowledge representation mechanisms consist of link- based structures; they may be studied formally by means of unordered trees. Here we consider the case where labels on the nodes are nonexistent or unreliable, and propose data mining processes focusing on just the link structure. We propose a representation of ordered trees, describe a com- binatorial characterization and some properties, and use them to propose an efficient algorithm for mining frequent closed subtrees from a set of input trees. Then we focus on unordered trees, and show that intrinsic characterizations of our representation provide for a way of avoiding the repeated exploration of unordered trees, and then we give an efficient algorithm for mining frequent closed unordered trees. 1 Introduction Trees, in a number of variants, are basically connected acyclic undirected graphs, with some additional structural notions like a distinguished vertex (root) or la- belings on the vertices. They are frequently a great compromise between graphs, which offer richer expressivity, and strings, which offer very efficient algorith- mics. From AI to Compilers, through XML dialects, trees are now ubiquitous in Informatics. One form of data analysis contemplates the search of frequent (or the so- called “closed”) substructures in a dataset of structures. In the case of trees, there are two broad kinds of subtrees considered in the literature: subtrees which are just induced subgraphs, called induced subtrees, and subtrees where contraction of edges is allowed, called embedded subtrees. In these contexts, the process of “mining” usually refers, nowadays, to a process of identifying which common substructures appear particularly often, or particularly correlated with other substructures, with the purpose of inferring new information implicit in a (large) dataset. In our case, the dataset would consist of a large set (more precisely, bag) of trees; algorithms for mining embedded labeled frequent trees include TreeMiner [22], which finds all embedded ordered subtrees that appear with a Partially supported by the 6th Framework Program of EU through the integrated project DELIS (#001907), by the EU PASCAL Network of Excellence, IST-2002- 506778, by the MEC TIN2005-08832-C03-03 (MOISES-BAR), MCYT TIN2004- 07925-C03-02 (TRANGRAM), and CICYT TIN2004-04343 (iDEAS) projects.