Bootstrapping for Hierarchical Document Classification Giordano Adami ITC-irst via Sommarive 18 38050 Povo, Italy gioadami@itc.it Paolo Avesani ITC-irst via Sommarive 18 38050 Povo, Italy avesani@itc.it Diego Sona ITC-irst via Sommarive 18 38050 Povo, Italy sona@itc.it ABSTRACT Managing the hierarchical organization of data is starting to play a key role in the knowledge management commu- nity due to the great amount of human resources needed to create and maintain these organized repositories of infor- mation. Machine learning community has in part addressed this problem by developing hierarchical supervised classifiers that help maintainers to categorize new resources within given hierarchies. Although such learning models succeed in exploiting relational knowledge, they are highly demand- ing in terms of labeled examples, because the number of categories is related to the dimension of the corresponding hierarchy. Hence, the creation of new directories or the mod- ification of existing ones require strong investments. This paper proposes a semi-automatic process (interleaved with human suggestions) whose aim is to minimize (sim- plify) the work required to the administrators when creating, modifying, and maintaining directories. Within this pro- cess, bootstrapping a taxonomy with examples represents a critical factor for the effective exploitation of any supervised learning model. For this reason we propose a method for the bootstrapping 1 process that makes a first hypothesis of cate- gorization for a set of unlabeled documents, with respect to a given empty hierarchy of concepts. Based on a revision of Self-Organizing Maps, namely TaxSOM, the proposed model performs an unsupervised classification, exploiting the a- priori knowledge encoded in a taxonomy structure both at the terminological and topological level. The ultimate goal of TaxSOM is to create the premise for successfully training a supervised classifier. Categories and Subject Descriptors F.1.1 [Computation by Abstract Devices]: Models of Computation—Self-modifying machines ; H.1.2 [Models 1 The term bootstrapping is not related to sampling theory. It is related to a more general concept, i.e. “to promote or develop by initiative and effort with little or no assistance” (taken from Merriam-Webster dictionary). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’03, November 3–8, 2003, New Orleans, Louisiana, USA. Copyright 2003 ACM 1-58113-723-0/03/0011 ...$5.00. and Principles]: User/Machine Systems; H.3 [Informa- tion Storage and Retrieval]: Miscellaneous; I.2.6 [Ar- tificial Intelligence]: Learning; I.5.3 [Pattern Recogni- tion]: Clustering; I.5.4 [Pattern Recognition]: Applica- tions—Text Processing General Terms Algorithms, Design, Experimentation. Keywords TaxSOM, constrained clustering, k-means, taxonomy boot- strapping process, text categorization. 1. INTRODUCTION Recent trends in knowledge management highlight the in- terest in organizing documents or other sources of knowl- edge in hierarchies of categories[4]. Taxonomies and Web directories are well known examples of such type of struc- tured indexing (see for example the Google’s Web directory). There are two basic elements in a taxonomy: a hierarchy of categories and a collection of documents. Categories are de- scribed both by linguistic labels denoting the “meaning” of the nodes and by the relationships with other categories. Documents are classified within one or more categories ac- cording to a single or multiple class problem. In this scenario, a typical task consists in the positioning of new documents within a given hierarchy of categories. Many supervised document classifiers that enable the au- tomation of this task have been designed for this purpose. However there is a big precondition: a “proper” amount of labeled documents for any category is required as training set. This issue is known in literature as the bootstrapping problem [20]. Bootstrapping a hierarchical structure of categories with a correct set of labeled examples is a critical step in the deployment of automated classifiers because the amount of labeled examples required to train a supervised learning al- gorithm is related to the dimension of the taxonomy. For example, the most popular web directories, like Google [14], Yahoo! [27], Looksmart [19], etc., have large hierarchical structures with many thousands of nodes, i.e. categories. Although in real world solutions the development of such structured indexes is an evolving process (that is document classification and structure definition are interleaved pro- cesses), we look at the bootstrapping task as a process that takes as input an empty taxonomy and a set of candidate examples and produces as output a labeling hypothesis for