Identification of Groupings of Graph Theoretical Molecular Descriptors Using a Hybrid
Cluster Analysis Approach
Stavros L. Taraviras,
²,‡
Ovidiu Ivanciuc,*
,§,|
and Daniel Cabrol-Bass*
,², ⊥
Chimiometry and Molecular Modeling Group, Aro ˆmes, Synthe `ses et Interactions Lab,
University of Nice-Sophia Antipolis, 28 Avenue Valrose, 06108 Nice Cedex 2, France, and Department of
Organic Chemistry, Faculty of Chemical Technology, University “Politehnica” of Bucharest,
Oficiul 12 CP 243, 78100 Bucharest, Romania
Received December 16, 1999
There is an abundance of structural molecular descriptors of various forms that have been proposed and
tested over the years. Very often different descriptors represent, more or less, the same aspects of molecular
structures and, thus, they have diminished discriminating power for the identification of different structural
features that might contribute to the molecular property, or activity of interest. Therefore, it is essential that
noncorrelated descriptors be employed to ensure the wider and the less inflated possible coverage of the
chemical space. The most usual approach for reducing the number of descriptors and employing noncorrelated
(or orthogonal) descriptors involves principal component analysis (PCA) or other factor analytical techniques.
In this work we present an approach for determining relationships (groupings) among 240 graph-theoretical
descriptors, as a means for selecting nonredundant ones, based on the application of cluster analysis (CA).
To remove inherent biases and particularities of different CA algorithms, several clustering solutions, using
these algorithms, were “hybridized” to obtain a reliable and confident overall solution concerning how the
interrelationships within the data are structured. The calculated correlation coefficients between descriptors
were used as a reference for a discussion on the different CA methods employed, and the resulted clusters
of descriptors were statistically analyzed for deriving the intercorrelations between the different operators,
weighting schemes and matrices used for the computation of these descriptors.
INTRODUCTION
Molecular Diversity and Molecular Descriptors. Com-
binatorial chemistry was introduced, as an alternative to the
rational molecular design approach, to allow the synthesis
of a wide range of specific and diverse compounds. These
compounds are then screened quickly and reliably against a
plethora of biological, or other, targets as part of the
molecular discovery process. Under this rationale, it would
seem to be the ideal way to identify new lead structures if
the molecules synthesized and the ones chosen for screening
represent an evenly distributed cross-section of all potentially
bioactive (for the case of pharmaceutical compounds)
structures. This requirement is the basis for the concept of
molecular diversity. The obvious approach is to characterize
the chemical structures in terms of molecular descriptors
based on their physicochemical or biological properties, their
substructures, or any other possibly relevant and calculable
features. Then one needs only to require that the ranges and
occurrences of these various features adequately describe the
universe of molecules, that is, cover the chemical space of
interest. This is the foundation of the similar property
principle, which states that chemically (structurally) similar
molecules will be expected to (often) exhibit similar biologi-
cal and physicochemical profiles.
1
The standard procedure
followed involves the computation of molecular descriptors
selected from classes of quantifiable two-dimensional (2D)
and three-dimensional (3D) molecular parameters. Examples
of the former include the atom pairs, molecular fingerprints,
and topological indices, whereas examples of the latter
include receptor recognition features, molecular shape, and
pharmacophores.
2-10
Generally, the 2D structural descriptors
are considered to be incorporating the major part of the
essential information about the chemical structure, and they
have been found to perform better than or at least as equally
well as the most complicated and resource-demanding 3D
descriptors, in most of the comparative studies conducted
thus far.
3-16
In addition the 2D molecular structure is the
essential basis of chemical synthesis strategies, both chemi-
cally and conceptually, and the same applies to patent claims,
where the molecules are treated as 2D objects. For assessing
the diversity of a collection of pharmacologically important
compounds, we must bear in mind that these compounds
have certain characteristics which facilitate their participation
in 3D physiological/biochemical events, namely, they are
small in size and molecular weight and they do not contain
“exotic” compounds, properties which go along well with
the 2D molecular description.
Topological Indices. The topological indices (TIs) are 2D
molecular descriptors whose values are associated with the
structural constitution of a chemical compound. They are
computed directly from the molecular graph of the particular
compound, and they offer several important advantages: they
* Corresponding author.
²
University of Nice-Sophia Antipolis.
‡
Present address: SGI (Silicon Graphics), European HQ, 1530 Arlington
Business Park, Reading, RG7 4SB, U.K.
§
University “Politehnica” of Bucharest.
|
E-mail: o•ivancicu@chim.upb.ro.
⊥
Telephone: +33 (0)4 92 07 61 20. E-mail: cabrol@unice.fr.
1128 J. Chem. Inf. Comput. Sci. 2000, 40, 1128-1146
10.1021/ci990149y CCC: $19.00 © 2000 American Chemical Society
Published on Web 08/26/2000