The Hidden Component of Size in Two-Dimensional Fragment Descriptors:
Side Effects on Sampling in Bioactive Libraries
Steven L. Dixon* and Ryan T. Koehler
†
Telik, Inc., 750 Gateway, South San Francisco, California 94080
Received December 16, 1998
We have carried out a number of sampling experiments in libraries of bioactive compounds to
illustrate how size biases introduced by two-dimensional (2D) fragment distance functions may
provide misleading information about the diversity of compound subsets. The number of
different biological targets covered by a given subset is used as a measure of bioactive diversity,
and it is considered to be the relevant property with which 2D diversity should correlate. Since
the nature of the size biases depends on the way in which 2D distance is computed, we
investigated three different methods of calculating distance. Use of 1-Tanimoto as a dissimilarity
measure leads to the spurious conclusion that collections of structurally small compounds are
inherently more diverse than other collections which may cover a broader range of sizes and
more biological targets. XOR or squared Euclidean distance, by contrast, shows a preference
for subsets of structurally larger compounds, but this does not appear to have as many adverse
consequences in terms of target coverage. A simple product of 1-Tanimoto and XOR tends to
equalize the opposing size effects of the two component distance functions and leads to a
relatively unbiased means of comparing structures. Results here suggest that careful
consideration should be given to the way in which chemical structures are compared whenever
2D fragment descriptors are used.
Introduction
As the boundaries of combinatorial chemistry and
high-throughput biological screening have expanded, so
too has the need for fast and meaningful computer-
based comparisons of molecular structures. With cor-
porate libraries converging on one million compounds
and virtual libraries containing orders of magnitude
more, such methodologies have become an absolute
necessity. Any means of comparing chemical structures
is valid only to the extent that it reflects intuitive
notions about similarity that have evolved over decades
in the field of medicinal chemistry. These ideas are
embodied in the similar property principle,
1
which states
that compounds with similar structures will tend to
exhibit similar physicochemical and biological proper-
ties. This concept provides much of the framework on
which modern lead optimization is built.
Curiously, though the similar property principle
makes no claims regarding dissimilar structures, it is
also used as the basis for essentially all work in the field
of molecular diversity.
2-4
Basically, the converse of the
principle is used to infer that dissimilar-looking struc-
tures will exhibit dissimilar properties. Though this may
be true to some extent, there is not always a valid, global
relationship
5
between structural dissimilarity and dif-
ferences in measured properties such as biological
activity. In general, as compounds become more diverse
structurally, we are progressively less certain of how
they compare to one another in terms of biological
activity.
5
For these reasons, we must be careful in
drawing conclusions about diversity based solely on
calculated measures of dissimilarity, and we should bear
in mind that biological targets provide the ultimate scale
on which diversity is usually measured.
Without some a priori knowledge of the structural
features that govern activity, or at least knowledge of
the best sets of descriptors to use when dealing with
specific targets, diversity in any true bioactive sense is
not something that can be easily manipulated by choice
of compounds. There are, however, some basic control-
lable factors that can have an effect on bioactive
properties and certain minimum requirements that
should be met in this regard. In particular, when
selecting compounds from a library on the basis of
dissimilarity, one should be confident that gross struc-
tural biases are not being introduced in the process.
There is evidence
6,7
to suggest that this sort of thing
may be happening when two-dimensional (2D) fragment
descriptors are used to measure diversity. Specifically,
we are concerned with the way in which widely utilized
2D distance functions introduce biases related to the
overall size of compounds and how this may ultimately
impact upon the bioactive properties of the subsets
selected.
To investigate these effects, we first define appropri-
ate scales on which to measure the properties identified
as molecular size, bioactive diversity, and 2D structural
diversity. We then carry out three basic types of
sampling experiments in libraries of compounds with
established pharmacological endpoints. In each experi-
ment, one of the three above properties is varied in a
systematic fashion, and the responses in the other two
properties are analyzed. This provides a means of
isolating the specific effects of 2D size biases and allows
a determination to be made as to whether these effects
* To whom correspondence should be addressed. E-mail: sdixon@
telik.com.
†
E-mail: koehler@telik.com.
2887 J. Med. Chem. 1999, 42, 2887-2900
10.1021/jm980708c CCC: $18.00 © 1999 American Chemical Society
Published on Web 07/07/1999