Richard Freeman, Hujun Yin and Nigel M. Allinson, "Self-Organising Maps for Tree View Based Hierarchical Document Clustering", in Proceedings of the IEEE IJCNN'02, Honolulu, Hawaii, 12-17 May, 2002. vol. 2, pp. 1906-1911. For more details and enhancements please refer to the journal papers listed on http://www.rfreeman.net/ Self-Organising Maps for Tree View Based Hierarchical Document Clustering Richard Freeman, Hujun Yin, Nigel M. Allinson. University of Manchester Institute of Science and Technology (UMIST), Department of Electrical Engineering and Electronics, PO Box 88, Manchester, M60 1QD, United Kingdom Email: , {H.Yin, allinson}@umist.ac.uk Abstract – In this paper we investigate the use of Self- Organising Maps (SOM) for document clustering. Previous methods using the SOM to cluster documents have used two- dimensional maps. This paper presents a hierarchical and growing method using a series of one-dimensional maps instead. Using this type of SOM is an efficient method for clustering documents and browsing them in a dynamically generated tree of topics. These topics are automatically discovered for each cluster, based on the set of document in a particular cluster. We demonstrate the efficiency of the method using different sets of real world web documents. I. INTRODUCTION The growth of documents available digitally on corporate Intranets is continuously increasing. This makes it more and more difficult for company employees to manually organise, sort, and retrieve documents located on their corporate Intranet. In this paper we address this issue by proposing a method that can autonomously organise documents, using a series of independently trained Self-Organising Maps (SOM) to automatically cluster unstructured web documents. Features, words and terms will also be used interchangeably. In section II we provide a brief introduction to the document clustering. Section III gives a description and review of the previous work on SOM and it’s variations applied to the document-clustering problem. In section IV we will introduce the proposed method for dealing with the document clustering as well as some experimental results. Finally in section V we conclude and describe the future directions of this work. II. DOCUMENT CLUSTERING Document clustering is an area that involves automatically grouping related documents together. This is done without any prior training or external knowledge and is purely based on inferring suitable classes for a given set of documents. A typical document clustering pre-processing phase uses the Vector Space Model (VSM) [1]. There are two main phases that are parsing and indexing. Parsing turns the text documents into a succession of words. The words are filtered using a basic “stop list” of common English words. This is used to discard words with little information, such as “the”, “because”, “it” that do not significantly contribute to discriminate between documents. A plural stemming or suffix stemming algorithm is then applied to those words [2]. In the Indexing phase, each document is then represented in the VSM where the frequency of occurrence of each word (or terms) in each document is recorded in a Documents Vs Terms matrix. These values are then generally weighted using the Term Frequency multiplied by the Inverse Document Frequency as shown in equation (1). This allows the less frequent terms to be given more weighting than the more frequent terms. Following Shannon's information theory the less frequent the word, the more information value it possess: this permits a better coverage of the input documents. i ij ij df N tf W log ⋅ = (1) W ij – Weight of term t j in document d i tf ij – Frequency of term t j in document d i N – Total number of documents in collection df j – Number of document containing term t j Once the set of document vectors has been created we can use hierarchical or non-hierarchical techniques from Cluster Analysis. Hierarchical clustering places documents into a hierarchical structure that is built dynamically. Non- hierarchical methods or flat partitioning divide documents into a set of flat clusters. Document clusters are usually created based on a pre-defined criterion or an error measure between the documents. The most common error measures used for evaluating the similarity of two documents are the Manhattan distance, the Euclidean or cosine correlation. These measures are then used with the chosen clustering method, such as hierarchical methods like Single-Link method, Complete-Link method or Group Average Link methods [3]. The SOM offers a “neural” alternative to clustering with an additional topological preserving property. We shall now introduce the general SOM algorithm and then proceed with some of the previous work on document clustering using the SOM. III. RELATED WORK 1) Introduction to the SOM The Self-Organising Map (SOM) is one of the most widely applied Artificial Neural Networks (ANN) first introduced by Kohonen [4]. It has successfully been used in a variety of