World Applied Sciences Journal 29 (Data Mining and Soft Computing Techniques): 53-59, 2014 ISSN 1818-4952 © IDOSI Publications, 2014 DOI: 10.5829/idosi.wasj.2014.29.dmsct.10 Corresponding Author: B. Firdaus Begam, Department of Computer Applications, School of Computer Science and Engineering, Bharathiar University, Coimbatore, Tamilnadu, India. 53 Visualization of Chemical Space Using Principal Component Analysis B. Firdaus Begam and J. Satheesh Kumar Department of Computer Applications, School of Computer Science and Engineering, Bharathiar University, Coimbatore, Tamilnadu, India Abstract: Principal component analysis is one of the most widely used multivariate methods to visualize chemical space in new dimension by the chemist for analysing data. In Multivariate data analysis, the relationship between two variables with more number of characteristics can be considered. PCA provides a compact view of variation in chemical data matrix which helps in creating better Quantitative Structure Activity Relationship (QSAR) model. It highlights the dominating pattern in the matrix through principal component and graphical representation. This paper focuses on mathematical aspects of principal components and role of PCA on Maybridge dataset to identify dominating hidden patterns of drug likeness based on Lipinski RO5. Key words: Principal Component Analysis Load p lot Score plot Biplot INTRODUCTION a new model of selected objects and variables with Principal Component Analysis (PCA) is a multivariate prediction model as PCA acts as exploratory tool for data statistical approach to analyze data in lower dimensional analysis by calculating the variance among the variables space. PCA used vector space transformation technique which are uncorrelated [8]. to view datasets from higher-dimensional space to lower Chemical space (drug space) is defined as dimensional space. PCA is an application of linear algebra number of descriptors calculated for each molecule [2] which was first coined by Karl Pearson during 1901 [1]. and stored in multidimensional space. Visualizing PCA has been rediscovered in many diverse scientific chemical space through lower dimensional space fields by Fischer and MAcKenzie [2], Wolf [3] and based on principal components [9]. Analysing the Hoteling [4]. In 1960, PCA has been taken by Malinowski space to identify hidden or dominating patterns of for chemical applications and later by many chemists [5]. drug-likeness molecules are done effectively by applying PCA is one of the multivariate methods and it is a member PCA method. of multidimensional factorial methods [6]. According to Jolliffe, “The central idea of PCA is to Importance of PCA: Data analysis through bivariate reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated and which are ordered so that the first few retain most of the variation present in all of the original variables” [7]. Goals of Principal component analysis are simplification, prediction, redundancy removal, feature extraction, un-mixing, data compression and other related areas. PCA gives a simplified view of larger datasets, by building minimal loss of information. It can also be used for analysis provides the correlation among two variables/descriptors X and Y . The pair wise correlations i i between descriptors are represented by Pearson correlation coefficient (r) lies between -1 and +1 as in eq (1). The degree of dependency or redundancy is analysed through the range. Correlation coefficient value +1 represents that the variables are positively correlated and correlation coefficient -1 represents that the variables are negatively correlated and zero coefficient represents that they are not correlated. The correlations among the variables are represented through scatter plot [10].