World Applied Sciences Journal 29 (Data Mining and Soft Computing Techniques): 53-59, 2014
ISSN 1818-4952
© IDOSI Publications, 2014
DOI: 10.5829/idosi.wasj.2014.29.dmsct.10
Corresponding Author: B. Firdaus Begam, Department of Computer Applications,
School of Computer Science and Engineering, Bharathiar University, Coimbatore, Tamilnadu, India.
53
Visualization of Chemical Space Using Principal Component Analysis
B. Firdaus Begam and J. Satheesh Kumar
Department of Computer Applications, School of Computer Science and Engineering,
Bharathiar University, Coimbatore, Tamilnadu, India
Abstract: Principal component analysis is one of the most widely used multivariate methods to visualize
chemical space in new dimension by the chemist for analysing data. In Multivariate data analysis, the
relationship between two variables with more number of characteristics can be considered. PCA provides a
compact view of variation in chemical data matrix which helps in creating better Quantitative Structure
Activity Relationship (QSAR) model. It highlights the dominating pattern in the matrix through principal
component and graphical representation. This paper focuses on mathematical aspects of principal components
and role of PCA on Maybridge dataset to identify dominating hidden patterns of drug likeness based on
Lipinski RO5.
Key words: Principal Component Analysis Load p lot Score plot Biplot
INTRODUCTION a new model of selected objects and variables with
Principal Component Analysis (PCA) is a multivariate prediction model as PCA acts as exploratory tool for data
statistical approach to analyze data in lower dimensional analysis by calculating the variance among the variables
space. PCA used vector space transformation technique which are uncorrelated [8].
to view datasets from higher-dimensional space to lower Chemical space (drug space) is defined as
dimensional space. PCA is an application of linear algebra number of descriptors calculated for each molecule
[2] which was first coined by Karl Pearson during 1901 [1]. and stored in multidimensional space. Visualizing
PCA has been rediscovered in many diverse scientific chemical space through lower dimensional space
fields by Fischer and MAcKenzie [2], Wolf [3] and based on principal components [9]. Analysing the
Hoteling [4]. In 1960, PCA has been taken by Malinowski space to identify hidden or dominating patterns of
for chemical applications and later by many chemists [5]. drug-likeness molecules are done effectively by applying
PCA is one of the multivariate methods and it is a member PCA method.
of multidimensional factorial methods [6].
According to Jolliffe, “The central idea of PCA is to Importance of PCA: Data analysis through bivariate
reduce the dimensionality of a data set consisting of a
large number of interrelated variables, while retaining as
much as possible of the variation present in the data set.
This is achieved by transforming to a new set of variables,
the principal components (PCs), which are uncorrelated
and which are ordered so that the first few retain most of
the variation present in all of the original variables” [7].
Goals of Principal component analysis are simplification,
prediction, redundancy removal, feature extraction,
un-mixing, data compression and other related areas.
PCA gives a simplified view of larger datasets, by building
minimal loss of information. It can also be used for
analysis provides the correlation among two
variables/descriptors X and Y . The pair wise correlations
i i
between descriptors are represented by Pearson
correlation coefficient (r) lies between -1 and +1 as in eq
(1). The degree of dependency or redundancy is analysed
through the range. Correlation coefficient value +1
represents that the variables are positively correlated
and correlation coefficient -1 represents that the variables
are negatively correlated and zero coefficient represents
that they are not correlated. The correlations among the
variables are represented through scatter plot [10].