Journal of Intelligent Information Systems, 18:2/3, 153–172, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections ALEXEI VINOKOUROV alexei@cs.rhul.ac.uk Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK MARK GIROLAMI mark.girolami@paisley.ac.uk School of Communication and Information Technologies, University of Paisley, High Street, Paisley, PA1 2BE, UK Abstract. This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance. Keywords: hierarchical probabilistic clustering, probabilistic latent semantic analysis, text categorization, support vector machines 1. Introduction The notion of hierarchy is of great importance in areas such as, for example Pattern Recog- nition, Machine Learning, Artificial Intelligence and of course information retrieval (IR). Hierarchy embodies the principle of divide-and-conquer which takes a complex problem and breaks it up into a number of simpler sub-problems whose solutions, when combined, provide a solution to the original complex problem. A tree structured mixture of simple experts was developed in Jordan (1994) to perform classification on complex problems and its performance provided enhanced results on particularly challenging pattern recog- nition problems (Jordan, 1994). A similar mixture of experts architecture was employed in Weigend et al. (1999) for text classification with consistently improved results reported on the particular test corpora utilized. Somewhat more recently a hierarchic structure composed of state-of-the-art support vector machines (SVM) (Vapnik, 1995; Dumais et al., 1998) was employed for the classification of a large collection of web page summaries in Dumais and Chen (2000). Complex classification tasks, in many cases, benefit from the adoption of a hierarchic ap- proach to the problem. It is then natural to consider the benefit that the task of unsupervised