Identification of Groupings of Graph Theoretical Molecular Descriptors Using a Hybrid Cluster Analysis Approach Stavros L. Taraviras, ²,‡ Ovidiu Ivanciuc,* ,§,| and Daniel Cabrol-Bass* ,², ⊥ Chimiometry and Molecular Modeling Group, Aro ˆmes, Synthe `ses et Interactions Lab, University of Nice-Sophia Antipolis, 28 Avenue Valrose, 06108 Nice Cedex 2, France, and Department of Organic Chemistry, Faculty of Chemical Technology, University “Politehnica” of Bucharest, Oficiul 12 CP 243, 78100 Bucharest, Romania Received December 16, 1999 There is an abundance of structural molecular descriptors of various forms that have been proposed and tested over the years. Very often different descriptors represent, more or less, the same aspects of molecular structures and, thus, they have diminished discriminating power for the identification of different structural features that might contribute to the molecular property, or activity of interest. Therefore, it is essential that noncorrelated descriptors be employed to ensure the wider and the less inflated possible coverage of the chemical space. The most usual approach for reducing the number of descriptors and employing noncorrelated (or orthogonal) descriptors involves principal component analysis (PCA) or other factor analytical techniques. In this work we present an approach for determining relationships (groupings) among 240 graph-theoretical descriptors, as a means for selecting nonredundant ones, based on the application of cluster analysis (CA). To remove inherent biases and particularities of different CA algorithms, several clustering solutions, using these algorithms, were “hybridized” to obtain a reliable and confident overall solution concerning how the interrelationships within the data are structured. The calculated correlation coefficients between descriptors were used as a reference for a discussion on the different CA methods employed, and the resulted clusters of descriptors were statistically analyzed for deriving the intercorrelations between the different operators, weighting schemes and matrices used for the computation of these descriptors. INTRODUCTION Molecular Diversity and Molecular Descriptors. Com- binatorial chemistry was introduced, as an alternative to the rational molecular design approach, to allow the synthesis of a wide range of specific and diverse compounds. These compounds are then screened quickly and reliably against a plethora of biological, or other, targets as part of the molecular discovery process. Under this rationale, it would seem to be the ideal way to identify new lead structures if the molecules synthesized and the ones chosen for screening represent an evenly distributed cross-section of all potentially bioactive (for the case of pharmaceutical compounds) structures. This requirement is the basis for the concept of molecular diversity. The obvious approach is to characterize the chemical structures in terms of molecular descriptors based on their physicochemical or biological properties, their substructures, or any other possibly relevant and calculable features. Then one needs only to require that the ranges and occurrences of these various features adequately describe the universe of molecules, that is, cover the chemical space of interest. This is the foundation of the similar property principle, which states that chemically (structurally) similar molecules will be expected to (often) exhibit similar biologi- cal and physicochemical profiles. 1 The standard procedure followed involves the computation of molecular descriptors selected from classes of quantifiable two-dimensional (2D) and three-dimensional (3D) molecular parameters. Examples of the former include the atom pairs, molecular fingerprints, and topological indices, whereas examples of the latter include receptor recognition features, molecular shape, and pharmacophores. 2-10 Generally, the 2D structural descriptors are considered to be incorporating the major part of the essential information about the chemical structure, and they have been found to perform better than or at least as equally well as the most complicated and resource-demanding 3D descriptors, in most of the comparative studies conducted thus far. 3-16 In addition the 2D molecular structure is the essential basis of chemical synthesis strategies, both chemi- cally and conceptually, and the same applies to patent claims, where the molecules are treated as 2D objects. For assessing the diversity of a collection of pharmacologically important compounds, we must bear in mind that these compounds have certain characteristics which facilitate their participation in 3D physiological/biochemical events, namely, they are small in size and molecular weight and they do not contain “exotic” compounds, properties which go along well with the 2D molecular description. Topological Indices. The topological indices (TIs) are 2D molecular descriptors whose values are associated with the structural constitution of a chemical compound. They are computed directly from the molecular graph of the particular compound, and they offer several important advantages: they * Corresponding author. ² University of Nice-Sophia Antipolis. ‡ Present address: SGI (Silicon Graphics), European HQ, 1530 Arlington Business Park, Reading, RG7 4SB, U.K. § University “Politehnica” of Bucharest. | E-mail: o•ivancicu@chim.upb.ro. ⊥ Telephone: +33 (0)4 92 07 61 20. E-mail: cabrol@unice.fr. 1128 J. Chem. Inf. Comput. Sci. 2000, 40, 1128-1146 10.1021/ci990149y CCC: $19.00 © 2000 American Chemical Society Published on Web 08/26/2000