Distributed Dimension Reduction Algorithms for Widely Dispersed Data ∗ Faisal N. Abu-Khzam † , Nagiza Samatova ‡§ , George Ostrouchov ‡ , Michael A. Langston †¶ , and Al Geist ‡ Abstract It is well known that information retrieval, cluster- ing and visualization can often be improved by re- ducing the dimensionality of high dimensional data. Classical techniques offer optimality but are much too slow for extremely large databases. The problem be- comes harder yet when data are distributed across geographically dispersed machines. To address this need, an effective distributed dimension reduction al- gorithm is developed. Motivated by the success of the serial (non-distributed) FastMap heuristic of Falout- sos and Lin, the distributed method presented here is intended to be fast, accurate and reliable. It runs in linear time and requires very little data transmis- sion. A series of experiments is conducted to gauge how the algorithm’s emphasis on minimal data trans- mission affects solution quality. Stress function mea- surements indicate that the distributed algorithm is highly competitive with the original FastMap heuris- tic. Keywords: Data Mining, Distributed Databases, Infor- mation Systems, Parallel and Distributed Algorithms 1 Introduction A set S of points in a d-dimensional space often be- long to an embedded manifold of dimension d ′ ≪ d. Classic dimension reduction techniques [3, 8, 5] com- pute an optimal k-dimensional representation of S * Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Depart- ment of Energy under Contract No.DE-AC05-00OR22725. † Department of Computer Science, University of Tennessee, Knoxville, TN 37996–3450. ‡ Computer Science and Mathematics Division, Oak Ridge National Laboratory, P.O.Box 2008, Oak Ridge, TN 37831– 6367. § Communicating author samatovan@ornl.gov. ¶ This author’s research is supported in part by the National Science Foundation under grants EIA–9972889 and CCR– 0075792, by the Office of Naval Research under grant N00014– 01–1–0608, and by the Tennessee Center for Information Tech- nology Research under award E01–0178–081. for a specified k ≤ d and a given optimality crite- rion. Techniques related to principal components [3] begin with coordinates of the points, whereas those related to multidimensional scaling [8, 5] begin with a complete set of pairwise distances. All of these re- quire at least quadratic running time, making them reasonable reduction candidates only as long as S is not too large. The focus of this paper, however, is on the case in which S is of some immense size N , with its elements distributed across a modest num- ber s of locations. This models a variety of timely environments, for example, when massive data sets reside on a number of different, geographically dis- persed machines. It is usually impractical or impos- sible to bring such data sets to a central location. Thus, our main objective is to reduce dimensional- ity in a way that does not require moving all the data, rather only some much smaller representation of the data. A similar approach is taken in [7]. A re- duction in dimensionality has been shown to help in data mining and related applications. For example, it can assist in effective data visualization and reveal the way the data are clustered [4, 6]. One of the major challenges researchers face in dealing with massive sets of data is algorithm scal- ability as the sets grow in size. Algorithms that scale as Ω(N 2 ) or higher quickly become computationally infeasible. Moreover, in parallel and distributed algo- rithms, the cost of data transmission often dominates the execution time. For these reasons, we seek a dis- tributed dimension reduction algorithm that not only runs in linear or almost-linear time, but also requires as little data communication as possible. Among the various alternatives available, we have chosen for exploitation the attractive FastMap heuris- tic [2]. It can be interpreted as an approximation to principal components that operates on pairwise distances rather than coordinates. FastMap is a linear-time serial algorithm. Even when data objects (points) are specified only by their d-dimensional co- ordinates, as they are in our case, FastMap runs in linear time and can serve as a dimension reduction al- gorithm [6]. We therefore wish to study the potential 1