Evaluating the Jaccard-Tanimoto Index on Multi-core Architectures Vipin Sachdeva 1 , Douglas M. Freimuth 2 , and Chris Mueller 3 1 IBM Future Technologies Design Center, Indianapolis, IN vsachde@us.ibm.com 2 IBM Watson Research Center, Hawthorne, NY dmfreim@us.ibm.com 3 Pervasive Technologies Labs and University Information Technology Services, Indiana University, Bloomington, IN chemuell@cs.indiana.edu Abstract. The Jaccard/Tanimoto coefficient is an important workload, used in a large variety of problems including drug design fingerprinting, clustering analysis, similarity web searching and image segmentation. This paper evaluates the Jaccard coefficient on three platforms: the Cell Broadband Engine TM processor Intel R Xeon R dual-core platform and Nvidia R 8800 GTX GPU. In our work, we have developed a novel parallel algorithm specially suited for the Cell/B.E. architec- ture for all-to-all Jaccard comparisons, that minimizes DMA transfers and reuses data in the local store. We show that our implementation on Cell/B.E. outperforms the implementations on com- parable Intel platforms by 6-20X with full accuracy, and from 10-50X in reduced accuracy mode, depending on the size of the data, and by more than 60X compared to Nvidia 8800 GTX. In addition to performance, we also discuss in detail our efforts to optimize our workload on these architectures and explain how avenues for optimization on each architec- ture are very different and vary from one architecture to another for our workload. Our work shows that the algorithms or kernels employed for the Jaccard coefficient calculation are heavily dependent on the traits of the target hardware. 1 Introduction Recent years have seen a resurgence in the number of hardware choices available to programmers. Multi-core processor architecture cores, which have multiple processing elements on a single chip are now the norm of the industry [1]. A vast number of hardware choices are now available to a high-performance com- puting programmer: general-purpose processors available from IBM, AMD and Intel have upto 8 cores, Cell/B.E. architecture has 8 special vector cores called SPEs and a PPC core called PPE, and more recently GPUs, capable of run- ning hundreds of threads, primarily meant for graphics processing tasks are also being evaluated for high-performance computing. This has very important implications for many industries, which could now accelerate their workloads G. Allen et al. (Eds.): ICCS 2009, Part I, LNCS 5544, pp. 944–953, 2009. c Springer-Verlag Berlin Heidelberg 2009