CatRelate: A New Hierarchical Document Category Integration Algorithm by Learning Category Relationships 1 Shanfeng Zhu 1 , Christopher C. Yang 2 , and Wai Lam 2 1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan zhusf@kuicr.kyoto-u.ac.jp 2 Department of System Engineering and Engineering Management, The Chinese University of Hong, Hong Kong {yang,wlam}@se.cuhk.edu.hk Abstract. We address the problem of integrating documents from a source catalog into a master catalog. Current technologies for solving the problem deem it as a flat category integration problem without considering the useful hierarchy information in the catalog, or deal with it hierarchically but without a rigorous model. In contrast, our method is based on correctly identifying relationships among categories, such as Match, Disjoint, SubConcept, SuperConcept, and Overlap, which come from the relations of sets in Set theory. Compared with traditional Match/NotMatch relationship in literature, our approach is more expressive in defining the relationship. The relationships among categories are first learned in a probabilistic way, and then refined by considering the hierarchy context. Our preliminary experiments show that it can help to correctly identify category relationships, and thus increase the accuracy of document integration. 1 Introduction With the development of the WWW, the amount of information (such as documents, Web pages) increases dramatically. To organize them effectively, hierarchical categorization, classifying documents into a hierarchical category, is widely adopted, such as Yahoo! Directory, and Google directory. The rapid growth of Internet and E- commerce has spurred people and enterprize's interest on integrating information from different sources that are organized in their specific hierarchies. How to integrate documents organized in one taxonomy (source catalog) into documents that are organized according to another taxonomy (master catalog) efficiently and effectively becomes increasingly important. This problem was first proposed and studied by Agrawal and Srikant [1]. They squeezed the hierarchical structures of the catalogs into flat structures, and extended 1 The work described in this paper was substantially supported by a grant from the Research grant Council of the Hong Kong Special Administrative Region, China (Project No: CUHK 4179/03E) and CUHK Strategic Grant (No: 4410001).