Stripmining For Molecules
Mark J. Embrechts
a
(embrem@rpi.edu), Fabio Arciniegas
a
, Muhsin Ozdemir
a
, Michinari Momma
a
, Curt M. Breneman
b
,
Larry Lockwood
b
, Kristin P. Bennett
b
, and Robert H. Kewley
d
Departments of
a
Decision Sciences and Engineering Systems,
b
Chemistry, and
c
Mathematical Sciences,
Rensselaer Polytechnic Institute, Troy, NY 12180.
d
United States Military Academy, West Point, NY
Abstract - QSAR (Quantitative Structure-Activity
Relationship) problems deal with “in-silico” chemical design
for the virtual invention of novel pharmaceuticals. The goal
of QSAR is to predict the bioactivities of molecules based on
a set of descriptive features. QSAR problems are notoriously
challenging for machine learning because a typical QSAR
predictive data mining problem set is characterized by a
large number of descriptive features (300-1000), often for a
relatively small number of molecules (50-300). This paper
introduces data strip mining for QSAR modeling. Strip
mining is a general approach for feature selection and
predictive modeling based on successive stages of feature
elimination done by performing a sensitivity analysis to a
predictive model.
I. INTRODUCTION
Efficient discovery and creation of novel
pharmaceuticals depends on the ability to explore
and quantify the relationships between molecular
structure and function – particularly biological
activity – but also human toxicity and drug
absorption/distribution throughout the body [14].
The pharmaceutical industry uses combinatorial
synthesis and high-throughput assay techniques that
experimentally examine thousands of potential drug
candidates per week. Since there are very large
numbers of molecules that must be screened for
potential activity against a particular disease, any
means of focusing the library of molecules is of great
interest. Virtual high-throughput screening (VHTS)
is a means of accomplishing this goal, and the
implementation of such virtual bioactivity screening
relies on the development of predictive and reliable
QSAR (Quantitative Structure-Activity Relationship)
models. The goal of QSAR is to predict the
bioactivity of molecules based on a set of descriptive
features.
QSAR is a difficult task since most of the
chemical effects that influence bioactivity are not
directly available as molecular descriptors. Several
types of descriptors are traditionally used in QSAR
investigations, including 2D, electro-topological and
3D descriptors. Modeling performance can be
enhanced through utilization of the newer
Transferable Atom Equivalent (TAE) descriptors.
The TAE methodology for generating descriptors
developed by one of the authors [4] provides a way
to quickly and inexpensively generate a large set of
predictive descriptors suitable for both classification
and regression studies.
Feature selection is essential since the set of
candidate features is very large for a relatively small
set of molecules. “Data strip mining” then provides a
method for extracting important features from
representative datasets based on successive
sensitivity analyses with a random gauge variable.
Data strip mining can be categorized as a standard
wrapper approach [16] and was found to work well
in spite of the fact that there are few observations of
the predicted property compared to the number of
available descriptors. The selected descriptors are
then used in three different predictive modeling
approaches (neural networks, partial least squares
(PLS) [22] and support vector machines [20]) and
produce good results on two challenging benchmark
data sets: the Lombardo blood-brain barrier data [18]
and the HIV (human immunodeficiency virus)
reverse-transcriptase inhibitor dataset (HIVrt) [11].
Until recently, PLS has been considered one of the
best methods for handling large numbers of non-
orthogonal descriptors in chemical QSAR problems.
This paper is organized as follows: section 2
introduces data strip mining sensitivity analysis for
feature selection, section 3 describes the learning
algorithms used, section 4 describes the benchmark
datasets, section 5 describes the TAE method for the
generation of descriptors, and section 6 discusses the
computational results.
0-7803-7278-6/02/$10.00 ©2002 IEEE