Improved Semantic Mapping and SOM applied to Document Organization
Renato Fernandes Corrêa
1
, Teresa Bernarda Ludermir
1
1
Center of Informatics, Federal University of Pernambuco
P.O. Box 7851, Cidade Universitária, Recife - PE, Brazil, 50.732-970
{rfc, tbl}@cin.ufpe.br
Abstract
This paper presents and evaluates a new version of
the Semantic Mapping method applied to the
construction of a hybrid document organization system
based on Self-Organizing Maps. The hybrid system
uses reduced document vectors generated by Semantic
Mapping to training the SOM map, thus reducing the
training time without compromising the quality of the
document map. We test the hybrid system with different
versions of Semantic Mapping and the correspondent
SOM system in document organization of the Reuters-
21758 v1.0 and 20 Newsgroup collections. The
performance of the systems was measured in terms of
text categorization effectiveness on test set and
training time. The experimental results show that new
version of Semantic Mapping is superior to previous
one.
1. Introduction
The large volume of document collections
nowadays has increased the need of fast trainable and
effective document organization systems to organize
automatically and allow knowledge discovery over
document collections.
The self-organizing map (SOM) [1] represents high
dimensional spaces in a bi-dimensional or one-
dimensional fashion preserving the more important
neighborhood relationships between vectors.
Due to this SOM property, SOM-based document
organization systems had been proposed and showed to
be relevant in archiving of unstructured text
collections, allowing a graphical visualization of the
document space, browsing and searching for a specific
item [2].
A problem that makes difficult the application of
SOM to document organization of large documents
collections is their computation time complexity,
consequently the SOM-based systems’ major drawback
is the huge amount of training time and resources
required for training of the document map.
In the especial case of SOM algorithm, this problem
has been addressed by WEBSOM project [2]. The
WEBSOM methodology does scale up well even to
very large datasets due to (a) fast dimensionality
reduction method called Random Mapping [2], (b)
shortcuts in computation of SOM algorithm and (c) a
method to estimate large maps from trained small ones,
progressively increasing the size of the SOM archive.
In WEBSOM, size reduction efforts on document
per term matrix have been concentrated mainly on the
number of dimensions, by using Random Mapping to
reduce the dimensionality of the line vectors
representing documents.
A document map [2] consists of a SOM map trained
with the document vectors. The quality of a document
map receives great influence of the document
representation and dimensionality reduction methods,
given that if the semantic similarity between documents
is clearly expressed by the similarity between document
vectors, then best quality document maps are
generated. This aspect motives the proposition of a
feature extraction method called Semantic Mapping
(SM) [3][4].
The combination of SM and SOM generates a
hybrid system that has a superior performance than
SOM systems using Random Mapping and
performance close to SOM systems using Principal
Component Analysis (PCA) in text categorization
[3][4][5].
The objective of this paper is to presents and
evaluates an improved version of Semantic Mapping
method. This version of SM is better than the previous
one because it is a deterministic method and improves
the performance of the hybrid system in document
organization.
The experiments are carried out with the Reuters-
21758 v1.0 and 20 Newsgroup collections. The
performance of the hybrid systems using Semantic
Eighth International Conference on Hybrid Intelligent Systems
978-0-7695-3326-1/08 $25.00 © 2008 IEEE
DOI 10.1109/HIS.2008.107
284
Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on April 05,2010 at 10:35:53 EDT from IEEE Xplore. Restrictions apply.