Large-Scale Systematic Analysis of 2D Fingerprint Methods and Parameters to Improve Virtual Screening Enrichments Madhavi Sastry, ‡ Jeffrey F. Lowrie, † Steven L. Dixon, † and Woody Sherman* ,† Schro ¨dinger, 120 West 45th Street, 17th Floor, New York, New York 10036 and Schro ¨dinger, Sanali Infopark, 8-2-120/113, Banjara Hills, Hyderabad 500034, Andhra Pradesh, India Received February 10, 2010 A systematic virtual screening study on 11 pharmaceutically relevant targets has been conducted to investigate the interrelation between 8 two-dimensional (2D) fingerprinting methods, 13 atom-typing schemes, 13 bit scaling rules, and 12 similarity metrics using the new cheminformatics package Canvas. In total, 157 872 virtual screens were performed to assess the ability of each combination of parameters to identify actives in a database screen. In general, fingerprint methods, such as MOLPRINT2D, Radial, and Dendritic that encode information about local environment beyond simple linear paths outperformed other fingerprint methods. Atom-typing schemes with more specific information, such as Daylight, Mol2, and Carhart were generally superior to more generic atom-typing schemes. Enrichment factors across all targets were improved considerably with the best settings, although no single set of parameters performed optimally on all targets. The size of the addressable bit space for the fingerprints was also explored, and it was found to have a substantial impact on enrichments. Small bit spaces, such as 1024, resulted in many collisions and in a significant degradation in enrichments compared to larger bit spaces that avoid collisions. INTRODUCTION Virtual screening 1-3 is a vital element of modern drug discovery, and the debate continues 4-9 on the relative merits of approaches that incorporate three-, two-, and even one- dimensional 8 (3D, 2D, and 1D, respectively) chemical information and representations. While docking 10-14 is frequently applied when suitable structural models of the target are available, purely ligand-based techniques, such as pharmacophore matching, 15-18 shape-based screening, 19-22 and 2D fingerprint similarity 4,5,23,24 provide alternative and complementary approaches, particularly when a target struc- tural model is not available, but active ligand molecules are at hand. Also, due to their relative high throughput, ligand- based methods are attractive when the number of compounds to screen is large and a fast turnaround is needed. Many of these and other virtual screening methods 25-29 may be available to a modeler, resulting in a multitude of choices, and decisions about which strategies to pursue may not be straightforward. Though 3D approaches are routinely viewed as holding the greatest promise, Occam’s razor ultimately governs many of the decisions in a drug discovery campaign, and 2D fingerprints continue to be widely used in industry, some- times with more success than 3D shape or docking methods. 9 Yet even when the focus is narrowed to 2D fingerprints, there are still numerous possibilities to consider due to the wide variety of available fingerprinting methods, atom-typing schemes, bit scaling rules, and similarity metrics. This is an important consideration for users of the cheminformatics package Canvas, 30 where more than 10 000 combinations of these four types of variables are possible. While previous studies have not explored every conceivable combination of these parameters in large-scale virtual screening experiments, systematic investigation of reasonable subspaces have pro- vided practical guidelines for modelers who are facing a bewildering array of choices. 31-34 Our intention in this work is to expand on the domain of previous studies by considering all of the aforementioned factors simultaneously. In this work, we have carried out a very large number of screens, incorporating over 1000 active ligands that span 11 pharmaceutically important targets, and more than 24 000 decoys from the MDL Drug Data Report (MDDR). 35 The goal was to identify general trends, such as which fingerprint methods perform well irrespective of the values of other variables as well as specific combinations that are recom- mended to maximize the success of virtual screening efforts and specific combinations to avoid. All combinations of 8 fingerprint methods, 13 atom-typing schemes, 13 bit scaling rules, and 12 similarity metrics were explored. We present the aggregate results as well as an analysis of each parameter. METHODS Fingerprint Types. Table 1 summarizes the eight types of fingerprints studied in this paper, as implemented in the cheminformatics package Canvas. 30 With the exception of MACCS, 36 Canvas fingerprints are encoded by hashing each chemical pattern into an addressable space of user-controlled size and storing only the “on” bits. A 32-bit fingerprint (the default in Canvas) is therefore represented by a list of integers on the interval [1, 2 32 ], where 2 32 is the size of the addressable space, and each integer represents the position of an “on” bit in this space. This sparse encoding contrasts with some other fingerprint implementations, 37-39 where hashing is * Corresponding author. Telephone: 646-366-9555. Fax: 646-366-9550. E-mail: Woody.Sherman@schrodinger.com. ‡ Schro ¨dinger, Andhra Pradesh, India. † Schro ¨dinger, New York, New York. J. Chem. Inf. Model. 2010, 50, 771–784 771 10.1021/ci100062n 2010 American Chemical Society Published on Web 05/07/2010