The Hidden Component of Size in Two-Dimensional Fragment Descriptors: Side Effects on Sampling in Bioactive Libraries Steven L. Dixon* and Ryan T. Koehler Telik, Inc., 750 Gateway, South San Francisco, California 94080 Received December 16, 1998 We have carried out a number of sampling experiments in libraries of bioactive compounds to illustrate how size biases introduced by two-dimensional (2D) fragment distance functions may provide misleading information about the diversity of compound subsets. The number of different biological targets covered by a given subset is used as a measure of bioactive diversity, and it is considered to be the relevant property with which 2D diversity should correlate. Since the nature of the size biases depends on the way in which 2D distance is computed, we investigated three different methods of calculating distance. Use of 1-Tanimoto as a dissimilarity measure leads to the spurious conclusion that collections of structurally small compounds are inherently more diverse than other collections which may cover a broader range of sizes and more biological targets. XOR or squared Euclidean distance, by contrast, shows a preference for subsets of structurally larger compounds, but this does not appear to have as many adverse consequences in terms of target coverage. A simple product of 1-Tanimoto and XOR tends to equalize the opposing size effects of the two component distance functions and leads to a relatively unbiased means of comparing structures. Results here suggest that careful consideration should be given to the way in which chemical structures are compared whenever 2D fragment descriptors are used. Introduction As the boundaries of combinatorial chemistry and high-throughput biological screening have expanded, so too has the need for fast and meaningful computer- based comparisons of molecular structures. With cor- porate libraries converging on one million compounds and virtual libraries containing orders of magnitude more, such methodologies have become an absolute necessity. Any means of comparing chemical structures is valid only to the extent that it reflects intuitive notions about similarity that have evolved over decades in the field of medicinal chemistry. These ideas are embodied in the similar property principle, 1 which states that compounds with similar structures will tend to exhibit similar physicochemical and biological proper- ties. This concept provides much of the framework on which modern lead optimization is built. Curiously, though the similar property principle makes no claims regarding dissimilar structures, it is also used as the basis for essentially all work in the field of molecular diversity. 2-4 Basically, the converse of the principle is used to infer that dissimilar-looking struc- tures will exhibit dissimilar properties. Though this may be true to some extent, there is not always a valid, global relationship 5 between structural dissimilarity and dif- ferences in measured properties such as biological activity. In general, as compounds become more diverse structurally, we are progressively less certain of how they compare to one another in terms of biological activity. 5 For these reasons, we must be careful in drawing conclusions about diversity based solely on calculated measures of dissimilarity, and we should bear in mind that biological targets provide the ultimate scale on which diversity is usually measured. Without some a priori knowledge of the structural features that govern activity, or at least knowledge of the best sets of descriptors to use when dealing with specific targets, diversity in any true bioactive sense is not something that can be easily manipulated by choice of compounds. There are, however, some basic control- lable factors that can have an effect on bioactive properties and certain minimum requirements that should be met in this regard. In particular, when selecting compounds from a library on the basis of dissimilarity, one should be confident that gross struc- tural biases are not being introduced in the process. There is evidence 6,7 to suggest that this sort of thing may be happening when two-dimensional (2D) fragment descriptors are used to measure diversity. Specifically, we are concerned with the way in which widely utilized 2D distance functions introduce biases related to the overall size of compounds and how this may ultimately impact upon the bioactive properties of the subsets selected. To investigate these effects, we first define appropri- ate scales on which to measure the properties identified as molecular size, bioactive diversity, and 2D structural diversity. We then carry out three basic types of sampling experiments in libraries of compounds with established pharmacological endpoints. In each experi- ment, one of the three above properties is varied in a systematic fashion, and the responses in the other two properties are analyzed. This provides a means of isolating the specific effects of 2D size biases and allows a determination to be made as to whether these effects * To whom correspondence should be addressed. E-mail: sdixon@ telik.com. E-mail: koehler@telik.com. 2887 J. Med. Chem. 1999, 42, 2887-2900 10.1021/jm980708c CCC: $18.00 © 1999 American Chemical Society Published on Web 07/07/1999