rXXXX American Chemical Society A dx.doi.org/10.1021/ci200235e | J. Chem. Inf. Model. XXXX, XXX, 000000 ARTICLE pubs.acs.org/jcim Anatomy of High-Performance 2D Similarity Calculations Imran S. Haque, Vijay S. Pande, , and W. Patrick Walters* ,§ Department of Computer Science and Department of Chemistry, Stanford University, Stanford, California 94305, United States § Vertex Pharmaceuticals Incorporated, 130 Waverley Street, Cambridge, Massachusetts 02139, United States b S Supporting Information INTRODUCTION A large variety of methods in chemical informatics, including compound selection, 1,2 clustering, and ligand-based virtual screening, depend on pairwise compound similarities as a critical subroutine. Continuing increases in the size of chemical data- bases (e.g., 35 million nominally purchasable compounds in ZINC 3 or nearly 1 billion possible compounds under 13 heavy atoms in GDB-13 4 ) create immense demands on computer power to run these algorithms. Consequently, there has been signicant interest in the development of fast methods to compute chemical similarity. Previous work has focused on the use of specialized hardware, 5À7 clever data structures, 8 or ap- proximation techniques 9 to accelerate large-scale pairwise simi- larity comparison using a variety of similarity methods. So-called two-dimensionalbit-vector Tanimoto similarities are particularly interesting by virtue of their dominant position in terms of similarity metrics used in the eld. These similarity measures represent molecules by long (hundreds to thousands of bits long) binary vectors representing the presence or absence of chemical features and compute pairwise compound similarity as a similarity coecient between pairs of such vectors. Past work has examined high-level algorithmic strategies to perform large-scale searches in sublinear time using complex data structures or bounds on the similarity measure to eliminate many comparisons. 8,10À12 However, in some cases these algorithms must still evaluate the underlying similarity measure a large number of times, motivating fast direct calculation of the 2D Tanimoto. Liao, Wang, and Watson recently reported that graphics processing units (GPUs), a type of massively parallel specialized hardware, achieved 73À143Â speedup on common 2D Tanimoto-based compound selection algorithms relative to the same methods running on a conventional CPU. 5 However, the reference CPU method used in their work was not properly optimized. In this paper, we discuss methods for the optimal implementa- tion of 2D similarity computations on modern CPUs. We combine architecture-specic fast implementations of the popu- lation count primitive and architecture-agnostic algorithms for reducing memory trac that enable 20À40Â speedup relative to traditional CPU methods and achieve 65% of the theoretical peak machine performance. We demonstrate the performance of our methods on two model problems: similarity matrix construction and leader clustering. Without using specialized hardware, we achieve performance that is at worst within 5Â that of GPU- based code and that at best beats the GPU. We include implementations of our high-speed algorithms under a permis- sive open-source license. OVERVIEW OF 2D SIMILARITY Two-dimensionalchemical similarity measures dene the similarity between a pair of compounds in terms of substructural similarities in their chemical graphs. Typical similarity measures of this type (e.g., MDL keys and path-based ngerprints like Daylight ngerprints) 13,14 ) represent molecules as binary vectors of a user-dened length. In simple ngerprints, such as MDL keys, 13 each bit represents the presence or absence of a particular chemical feature. Hashed ngerprints, such as the ECFP family, 14 Received: May 25, 2011 ABSTRACT: Similarity measures based on the comparison of dense bit vectors of two-dimensional chemical features are a dominant method in chemical informatics. For large-scale pro- blems, including compound selection and machine learning, computing the intersection between two dense bit vectors is the overwhelming bottleneck. We describe ecient implementa- tions of this primitive as well as example applications using features of modern CPUs that allow 20À40Â performance increases relative to typical code. Specically, we describe fast methods for population count on modern x86 processors and cache-ecient matrix traversal and leader clustering algorithms that alleviate memory bandwidth bottlenecks in similarity matrix construction and clustering. The speed of our 2D comparison primitives is within a small factor of that obtained on GPUs and does not require specialized hardware.