Improved Semantic Mapping and SOM applied to Document Organization Renato Fernandes Corrêa 1 , Teresa Bernarda Ludermir 1 1 Center of Informatics, Federal University of Pernambuco P.O. Box 7851, Cidade Universitária, Recife - PE, Brazil, 50.732-970 {rfc, tbl}@cin.ufpe.br Abstract This paper presents and evaluates a new version of the Semantic Mapping method applied to the construction of a hybrid document organization system based on Self-Organizing Maps. The hybrid system uses reduced document vectors generated by Semantic Mapping to training the SOM map, thus reducing the training time without compromising the quality of the document map. We test the hybrid system with different versions of Semantic Mapping and the correspondent SOM system in document organization of the Reuters- 21758 v1.0 and 20 Newsgroup collections. The performance of the systems was measured in terms of text categorization effectiveness on test set and training time. The experimental results show that new version of Semantic Mapping is superior to previous one. 1. Introduction The large volume of document collections nowadays has increased the need of fast trainable and effective document organization systems to organize automatically and allow knowledge discovery over document collections. The self-organizing map (SOM) [1] represents high dimensional spaces in a bi-dimensional or one- dimensional fashion preserving the more important neighborhood relationships between vectors. Due to this SOM property, SOM-based document organization systems had been proposed and showed to be relevant in archiving of unstructured text collections, allowing a graphical visualization of the document space, browsing and searching for a specific item [2]. A problem that makes difficult the application of SOM to document organization of large documents collections is their computation time complexity, consequently the SOM-based systems’ major drawback is the huge amount of training time and resources required for training of the document map. In the especial case of SOM algorithm, this problem has been addressed by WEBSOM project [2]. The WEBSOM methodology does scale up well even to very large datasets due to (a) fast dimensionality reduction method called Random Mapping [2], (b) shortcuts in computation of SOM algorithm and (c) a method to estimate large maps from trained small ones, progressively increasing the size of the SOM archive. In WEBSOM, size reduction efforts on document per term matrix have been concentrated mainly on the number of dimensions, by using Random Mapping to reduce the dimensionality of the line vectors representing documents. A document map [2] consists of a SOM map trained with the document vectors. The quality of a document map receives great influence of the document representation and dimensionality reduction methods, given that if the semantic similarity between documents is clearly expressed by the similarity between document vectors, then best quality document maps are generated. This aspect motives the proposition of a feature extraction method called Semantic Mapping (SM) [3][4]. The combination of SM and SOM generates a hybrid system that has a superior performance than SOM systems using Random Mapping and performance close to SOM systems using Principal Component Analysis (PCA) in text categorization [3][4][5]. The objective of this paper is to presents and evaluates an improved version of Semantic Mapping method. This version of SM is better than the previous one because it is a deterministic method and improves the performance of the hybrid system in document organization. The experiments are carried out with the Reuters- 21758 v1.0 and 20 Newsgroup collections. The performance of the hybrid systems using Semantic Eighth International Conference on Hybrid Intelligent Systems 978-0-7695-3326-1/08 $25.00 © 2008 IEEE DOI 10.1109/HIS.2008.107 284 Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on April 05,2010 at 10:35:53 EDT from IEEE Xplore. Restrictions apply.