Visual Mapping of Text Collections through a Fast High Precision Projection
Technique
Fernando Vieira Paulovich, Luis Gustavo Nonato, Rosane Minghim
Institute of Mathematics and Computer Science
University of S˜ ao Paulo, S˜ ao Carlos, SP, Brazil
{paulovic, gnonato, rminghim}@icmc.usp.br
Haim Levkowitz
University of Massachusetts Lowell, USA
haim@cs.um.edu
Abstract
This paper introduces Least Square Projection (LSP),
a fast technique for projection of multi-dimensional data
onto lower dimensions developed and tested successfully in
the context of creation of text maps based on their content.
Current solutions are either based on computationally ex-
pensive dimension reduction with no proper guarantee of
the outcome or on faster techniques that need some sort
of post-processing for recovering information lost during
the process. LSP is based on least square approximation,
a technique originally employed for surface modeling and
reconstruction. Least square approximations are capable
of computing the coordinates of a set of projected points
based on a reduced number of control points with defined
geometry. We extend the concept for general data sets. In
order to perform the projection, a small number of distance
calculations is necessary and no repositioning of the final
points is required to obtain a satisfactory precision of the fi-
nal solution. Textual information is a typically difficult data
type to handle, due to its intrinsic dimensionality. We em-
ploy document corpora as a benchmark to demonstrate the
capabilities of the LSP to group and separate documents by
their content with high precision.
1. Introduction
The problem of multi-dimensional projection has been
the concern of many researchers, due to the large variety of
applications that could make use of a bi-dimensional (2D)
positioning by similarity of points defined in a space with
many dimensions.
Techniques based on sub-space decomposition suffer
from high computational times, while faster projections
generate higher degrees of information loss that have to
be recovered via repositioning of the points, usually with
a force-based placement strategy.
The application of projections for textual document map-
ping and visualization poses an extra challenge, also tackled
by many, due to the fact that vector space representations
of texts can reach thousands of dimensions easily, even for
data sets with a few hundred documents. However, extract-
ing meaning of document contents and their relationships
to other documents automatically is an application with im-
portant ramifications in visual analysis. Our goal is to cre-
ate a surface-based map of documents that can be explored
to find patterns and relevant information within text collec-
tions.
This paper presents a solution for multi-dimensional pro-
jection that, besides being fast, results in a high precision of
final positioning of points, avoiding force-based placement
and also avoiding the calculation of the full distance matrix
between points in the original space. The technique is ca-
pable of grouping related documents as well as separating
well those groups making the display very useful for further
user exploration.
The next section reviews the work on text mapping for
comparison with the application targeted here. It also high-
lights the classes of techniques previously employed for
the formation of visual representations of text sets. LSP
is presented in Section 3. Our results are presented in Sec-
tion 4. Those are followed by conclusions and further de-
velopments of the present work.
Proceedings of the Information Visualization (IV’06)
0-7695-2602-0/06 $20.00 © 2006 IEEE