Mining Statistically Significant Molecular Substructures for Efficient Molecular Classification Sayan Ranu* and Ambuj K. Singh* Department of Computer Science, University of California, Santa Barbara, California Received January 28, 2009 The increased availability of large repositories of chemical compounds has created new challenges in designing efficient molecular querying and mining systems. Molecular classification is an important problem in drug development where libraries of chemical compounds are screened and molecules with the highest probability of success against a given target are selected. We have developed a technique called GraphSig to mine significantly over-represented molecular substructures in a given class of molecules. GraphSig successfully overcomes the scalability bottleneck of mining patterns at a low frequency. Patterns mined by GraphSig display correlation with biological activities and serve as an excellent platform on which to build molecular analysis tools. The potential of GraphSig as a chemical descriptor is explored, and support vector machines are used to classify molecules described by patterns mined using GraphSig. Furthermore, the over-represented patterns are more informative than features generated exhaustively by traditional fingerprints; this has potential in providing scaffolds and lead generation. Extensive experiments are carried out to evaluate the proposed techniques, and empirical results show promising performance in terms of classification quality. An implementation of the algorithm is available free for academic use at http://www.uweb.ucsb.edu/sayan/ software/GraphSig.tar. INTRODUCTION In modern pharmaceutical research, molecular classifica- tion plays a key role in lead generation and lead optimization. 1,2 Consequently, molecular classification has been a vibrant research field. Various techniques have been developed including neural network techniques, 3,4 Bayesian models, 5,6 kernel-based methods, 7-9 cell-based or statistical partitioning methods, 10 decision trees or recursive partitioning based methods, 11-13 and subgraph mining-based techniques. 14 Typically, the goal of the mining process is to infer chemical or biological properties of a molecule from its structure, popularly known as the quantitative structure-activity relationship (QSAR) approach, 15,16 so that a reduced set of potential hits can be selected from large virtual libraries. 17 The QSAR approach is based on the assumption that the molecular structure is a good indicant of chemical properties. Therefore, a crucial step in analyzing molecular structures is to extract the chemically informative content embedded in the structure. A number of classification approaches are based on a feature vector representation of the molecules, 5,8,9,12-14 and consequently, extensive research has been performed on analyzing molecular structures and representing them in a virtual space. Toward this goal, chemical descriptors or fingerprints have been widely used to represent molecules in the form of a feature vector, which counts or indicates the presence of certain features in a molecule. 14,18-20 Typically, these features are pharmacophoric features, such as hydrogen bond donors, hydrogen bond acceptors, and hydrophobic centers, or small substructures. The features can therefore be viewed as reference points to describe molecules, and thus, it is of utmost importance to choose them wisely. Existing vector-based approaches, in the context of map- ping molecules to chemical descriptors, can be broadly grouped into two sets: structural keys 18,23 and hashed fingerprints. 19,24 Fingerprints characterize a molecule by generating a fixed width bit vector. Typically, fingerprints are generated by enumerating all cycles and linear paths up to a certain size and hashing each of the structural features into the vector. As a result, fingerprints are able to encode a large number of features in a relatively compact manner. On the other hand, structural keys contain a dictionary of features that are used to screen a molecule and generate its vector representation. The dictionary is known a priori where the features are chosen based on some domain information. A variant of this technique takes a dynamic approach to build the dictionary. 14,21,22 The dictionary is populated by mining features from the given database of molecules and therefore, like fingerprints, is able to automatically adapt to any molecular database. In recent years, promising classification results have been observed in the second setting, where dynamic features are employed to characterize molecules. One popular approach is to mine frequent substructures from molecules with known chemical and biological properties. 21,22,25 The mined struc- tures are then used to represent molecules in the virtual space in the form of feature vectors or histograms that indicate the presence or absence of the mined features. Typically in this approach, a frequency threshold is supplied and all substructures that occur at a frequency more than the given threshold are used as part of the descriptor. However, frequent substructures may not provide the best characteriza- tion of the molecules since frequency may not always be * E-mail: sayan@cs.ucsb.edu (S.R.); ambuj@cs.ucsb.edu (A.K.S.). J. Chem. Inf. Model. 2009, 49, 2537–2550 2537 10.1021/ci900035z CCC: $40.75 2009 American Chemical Society Published on Web 10/28/2009