Visual Mapping of Text Collections through a Fast High Precision Projection Technique Fernando Vieira Paulovich, Luis Gustavo Nonato, Rosane Minghim Institute of Mathematics and Computer Science University of S˜ ao Paulo, S˜ ao Carlos, SP, Brazil {paulovic, gnonato, rminghim}@icmc.usp.br Haim Levkowitz University of Massachusetts Lowell, USA haim@cs.um.edu Abstract This paper introduces Least Square Projection (LSP), a fast technique for projection of multi-dimensional data onto lower dimensions developed and tested successfully in the context of creation of text maps based on their content. Current solutions are either based on computationally ex- pensive dimension reduction with no proper guarantee of the outcome or on faster techniques that need some sort of post-processing for recovering information lost during the process. LSP is based on least square approximation, a technique originally employed for surface modeling and reconstruction. Least square approximations are capable of computing the coordinates of a set of projected points based on a reduced number of control points with defined geometry. We extend the concept for general data sets. In order to perform the projection, a small number of distance calculations is necessary and no repositioning of the final points is required to obtain a satisfactory precision of the fi- nal solution. Textual information is a typically difficult data type to handle, due to its intrinsic dimensionality. We em- ploy document corpora as a benchmark to demonstrate the capabilities of the LSP to group and separate documents by their content with high precision. 1. Introduction The problem of multi-dimensional projection has been the concern of many researchers, due to the large variety of applications that could make use of a bi-dimensional (2D) positioning by similarity of points defined in a space with many dimensions. Techniques based on sub-space decomposition suffer from high computational times, while faster projections generate higher degrees of information loss that have to be recovered via repositioning of the points, usually with a force-based placement strategy. The application of projections for textual document map- ping and visualization poses an extra challenge, also tackled by many, due to the fact that vector space representations of texts can reach thousands of dimensions easily, even for data sets with a few hundred documents. However, extract- ing meaning of document contents and their relationships to other documents automatically is an application with im- portant ramifications in visual analysis. Our goal is to cre- ate a surface-based map of documents that can be explored to find patterns and relevant information within text collec- tions. This paper presents a solution for multi-dimensional pro- jection that, besides being fast, results in a high precision of final positioning of points, avoiding force-based placement and also avoiding the calculation of the full distance matrix between points in the original space. The technique is ca- pable of grouping related documents as well as separating well those groups making the display very useful for further user exploration. The next section reviews the work on text mapping for comparison with the application targeted here. It also high- lights the classes of techniques previously employed for the formation of visual representations of text sets. LSP is presented in Section 3. Our results are presented in Sec- tion 4. Those are followed by conclusions and further de- velopments of the present work. Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006 IEEE