14 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA J. Chem. Inf. Comput. zyxwvu Sci. 1983, zyxwvu 23, zyxwv 14-22 Computerized Chemical Structure-Handling Techniques in Structure-Activity Studies and Molecular Property Prediction DAVID BAWDEN Pfizer Central Research, Sandwich, Kent CT13 9NJ, England Received May 26, 1982 Applications of computerizedstructure-handling techniques to studies of the relationship between molecular properties and chemical structure are reviewed. These applications include physi- cochemical property prediction, structureactivity correlation, largescale substructural analysis, and structural classification. A comparison of types of structural descriptor is made, with reference to statistical amenability, interpretability,feature selection and reduction, and mixing of descriptor types. Implications for chemical information systems are discussed. INTRODUCTION Techniques for automatic handling of chemical structure representations, within computerized chemical information systems, are well-developed for information storage and re- trieval.’J Such techniques are being increasingly applied to studies of structureproperty relationships, involving a variety of molecular properties and statistical methods. Some aspects of these applications are discussed here. The aim of the author has been to give a comprehensive coverage of all significant applications reported in the literature up to early 1982. Many of the structure-handling facilities required for such studies are largely the same as for information storage and retrieval (structure input-output etc.) and are not dealt with in detail. A distinctive feature is the automatic analysis of structural representations to identify aspects of molecular structure which may be relevant to the structure-property problem under consideration, and this will receive most at- tention here. In a few cases, noncomputerized studies will be discussed, where there is a clear potential for computerization or where a useful comparison with computerized techniques may be made. A generalized view of computer-aided structure-property studies is shown schematically in Figure 1. The computer- readable file of structural representations is analyzed to gen- erate structural features, for which the presence/absence in- dication or Occurrence counts will be variables in subsequent analyses. This process is termed ”feature derivation” or “feature perception”. “Descriptor generation” is a more general term for the derivation of all variables to be used in an analysis: these may include descriptors other than purely structural, e.g., parameters from molecular orbital calculations, or calculated structural quantities, e.g., topological indexes. Some aspects of the use of mixed descriptor sets are discussed below. Substructures thus derived may be used in one of three ways. (i) Property estimation, by summation of substructural contributions to a thermochemical or physicochemical prop- erty: The contributions corresponding to the structural features perceived in the molecular structure of interest are extracted automatically from a data base of substructural contributions. No statistical analysis is carried out in this procedure, although the values of the substructural contributions may have been obtained initially from such an analysis. Such procedures amount to an automation of the well-known additivity schemes for thermochemical and physicochemical proper tie^.^ They may be particularly important in calculating molecular properties rapidly for large groups of structures, e.g., ther- modynamic parameters for computer-aided synthesis planning4 or lipophilicity parameters for physicochemical property-bi- ological activity correlation.5 If the structural features per- ceived correspond to substituents on a common parent struc- ture, such a procedure would be used in conjunction with a data base of substituent constants as an aid to the Hansch form of linear free energy relationship studies. (ii) Structure-property correlation, with the Occurrence (or presence/absence indication) of features within molecular structures used as variables in statistical analyses for corre- lation of structure with physicochemical properties or biological activities: Dependent upon the type of features used, and the statistical analysis employed, this computer -aided procedure may correspond to established techniques of quantitative structureactivity relationships (QSAR), e.g., Free-Wilson analysis,6 or may be an entirely new departure, e.g., the sub- structural analysis methodology, first described by Cramer et (iii) Structural classification, in the widest sense, involving some assessment of the similarities within a set of structures, based on structural features: Although molecular property data is not directly involved in such a procedure, the classi- fication obtained may subsequently be applied in SAR studies, qualitative or quantitative. Areas such as computer-aided synthetic planning* and computerized elucidation of reaction mechanisms9 will not be discussed per se, although certain relevant aspects of these studies will be noted. Similarly, the use of substructures in spectral simulation and interpretation (see, for example, ref 10-12) are not discussed here. One constant problem in this area is that it is frequently unclear whether the molecular property under investigation is affected by the whole structure (as in the case of a bulk physical property) or only by a constituent substructure (as in the case of a specific biological activity). In some cases both factors may be involved, as in the case of a series of compounds with specific pharmacological activity (invoked by a sub- structure) modified by the compounds’ distribution properties (such as the partition coefficient; bulk properties of the total structure). It is therefore necessary to be able to identify and analyze either or both of these structural effects. OVERVIEW OF TYPES OF STRUCTURAL FEATURES The types of structural features used in structure-property studies are now outlined briefly. It should be noted that the subdivisions used, although covenient, are far from precise delimitations, as will be noted below. (i) Simple Features. The simplest types of structural feature which may be used are counts of the most basic structural units present: total number of atoms, bonds, or rings, occurrence of particular atoms, multiple bonds, etc. Descriptors of this sort are in general too crude to be used alone in structure- property studies, since they are insufficiently discriminating between structures (other than in a poorly structured data set), and their use can make interpretation difficult. A study in which features of this type were used alone in correlation of zyxwvutsrq 0095-2338/83/1623-0014$01.50/0 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA 0 1983 American Chemical Society