Stripmining For Molecules Mark J. Embrechts a (embrem@rpi.edu), Fabio Arciniegas a , Muhsin Ozdemir a , Michinari Momma a , Curt M. Breneman b , Larry Lockwood b , Kristin P. Bennett b , and Robert H. Kewley d Departments of a Decision Sciences and Engineering Systems, b Chemistry, and c Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180. d United States Military Academy, West Point, NY Abstract - QSAR (Quantitative Structure-Activity Relationship) problems deal with “in-silico” chemical design for the virtual invention of novel pharmaceuticals. The goal of QSAR is to predict the bioactivities of molecules based on a set of descriptive features. QSAR problems are notoriously challenging for machine learning because a typical QSAR predictive data mining problem set is characterized by a large number of descriptive features (300-1000), often for a relatively small number of molecules (50-300). This paper introduces data strip mining for QSAR modeling. Strip mining is a general approach for feature selection and predictive modeling based on successive stages of feature elimination done by performing a sensitivity analysis to a predictive model. I. INTRODUCTION Efficient discovery and creation of novel pharmaceuticals depends on the ability to explore and quantify the relationships between molecular structure and function particularly biological activity but also human toxicity and drug absorption/distribution throughout the body [14]. The pharmaceutical industry uses combinatorial synthesis and high-throughput assay techniques that experimentally examine thousands of potential drug candidates per week. Since there are very large numbers of molecules that must be screened for potential activity against a particular disease, any means of focusing the library of molecules is of great interest. Virtual high-throughput screening (VHTS) is a means of accomplishing this goal, and the implementation of such virtual bioactivity screening relies on the development of predictive and reliable QSAR (Quantitative Structure-Activity Relationship) models. The goal of QSAR is to predict the bioactivity of molecules based on a set of descriptive features. QSAR is a difficult task since most of the chemical effects that influence bioactivity are not directly available as molecular descriptors. Several types of descriptors are traditionally used in QSAR investigations, including 2D, electro-topological and 3D descriptors. Modeling performance can be enhanced through utilization of the newer Transferable Atom Equivalent (TAE) descriptors. The TAE methodology for generating descriptors developed by one of the authors [4] provides a way to quickly and inexpensively generate a large set of predictive descriptors suitable for both classification and regression studies. Feature selection is essential since the set of candidate features is very large for a relatively small set of molecules. “Data strip mining” then provides a method for extracting important features from representative datasets based on successive sensitivity analyses with a random gauge variable. Data strip mining can be categorized as a standard wrapper approach [16] and was found to work well in spite of the fact that there are few observations of the predicted property compared to the number of available descriptors. The selected descriptors are then used in three different predictive modeling approaches (neural networks, partial least squares (PLS) [22] and support vector machines [20]) and produce good results on two challenging benchmark data sets: the Lombardo blood-brain barrier data [18] and the HIV (human immunodeficiency virus) reverse-transcriptase inhibitor dataset (HIVrt) [11]. Until recently, PLS has been considered one of the best methods for handling large numbers of non- orthogonal descriptors in chemical QSAR problems. This paper is organized as follows: section 2 introduces data strip mining sensitivity analysis for feature selection, section 3 describes the learning algorithms used, section 4 describes the benchmark datasets, section 5 describes the TAE method for the generation of descriptors, and section 6 discusses the computational results. 0-7803-7278-6/02/$10.00 ©2002 IEEE