Browsing Large Scale Cheminformatics Data with Dimension Reduction Judy Qiu 1 , Jong Youl Choi 1,2 , Seung-Hee Bae 1,2 , Thilina Gunarathne 1,2 , Geoffrey Fox 1,2 , Bin Cao 2 , David Wild 2 1 Pervasive Technology Institute, 2 School of Informatics and Computing, Indiana University Bloomington IN, U.S.A. { xqiu, jychoi, sebae, gcf@indiana.edu} SALSA project (salsahpc.indiana.edu) investigates new programming models of parallel multicore computing and Cloud/Grid computing. It aims at developing and applying parallel and distributed Cyberinfrastructure to support large scale data analysis. We demonstrate this with a project for life sciences and present PubChemBrowse, a customized visualization tool for Cheminformatics research. Visualization of large-scale high dimensional data tool is highly valuable for scientific discovery in many fields. We present a novel 3D data point browser that displays complex properties of massive data on commodity clients. As in GIS browsers for Earth and environment data, chemical compounds with similar properties are nearby in the high dimensional space. PubChemBrowse is built around in-house high performance parallel MDS (Multi-Dimensional Scaling) and GTM (Generative Topographic Mapping) [1] [2] services and supports fast interaction with an external biochemical repository database. We provide robust deterministic annealing and interpolation for adding addition points. The browser will scale up to 60 million points of full NIH PubChem. We demonstrate the following key features of PubChemBrowse: i) A lightweight 3D data visualization client to browse large (a few million) and high-dimensional data backed by high-performance cloud technology [3]. Displaying various kinds of meta-data as extra information. ii) On-line data fetching by connecting a remote external system, Chem2Bio2RDF [4], which is an integrated repository of chemogenomic and systems chemical biology data. iii) Research results for drug discovery with mining cause-effect relationship between large number of chemical compounds and diseases. Figure 1. Architecture of PubChemBrowse By using PubChemBrowse, one can easily identify points of interests by colors or select a group of points distinguished by structural distribution in 3D space. Additional functions include browsing the data by rotation, zooming or panning the 3D space to search for details. Dynamic updating the labels of points or adding new data points are supported by sending on-line SPARQL query to Chem2Bio2RDF system. With our tool, researchers can easily browse very large datasets with ease. We've developed parallel MDS and GTM algorithms [1] [2] to visualize large and high-dimensional data. As shown in Figure 2, we processed 0.1 million PubChem data with 166 dimensions and used parallel interpolation algorithms to speed up the process for up to 2M PubChem points.