ClawHMMER: A Streaming HMMer-Search Implementation Daniel Reiter Horn Mike Houston Pat Hanrahan Stanford University Abstract The proliferation of biological sequence data has motivated the need for an extremely fast probabilistic sequence search. One method for performing this search involves evaluating the Viterbi probability of a hidden Markov model (HMM) of a desired sequence family for each sequence in a protein database. However, one of the difficulties with current im- plementations is the time required to search large databases. Many current and upcoming architectures offering large amounts of compute power are designed with data-parallel execution and streaming in mind. We present a streaming algorithm for evaluating an HMM’s Viterbi probability and refine it for the specific HMM used in biological sequence search. We implement our streaming algorithm in the Brook language, allowing us to execute the algorithm on graphics processors. We demonstrate that this streaming algorithm on graphics processors can outperform available CPU im- plementations. We also demonstrate this implementation running on a 16 node graphics cluster. Keywords: Bio Science, Data Parallel Computing, Stream Computing, Programmable Graphics Hardware, GPU Com- puting, Brook 1 Introduction Biological sequence data are becoming both more plenti- ful and more accessible to researchers around the world. Specifically, rich databases of protein and DNA sequence data are available at no cost on the Internet. For in- stance, millions of proteins are available in the NCBI Non- redundant protein database [2005] and at the Universal Pro- tein Resource [2005], and likewise the GenBank DNA data- base [NCB 2005] contains millions of genes. With this pro- liferation of data comes a large computational cost to query and reason about the relationships of a given family of se- quences. While a normal string compare is computationally sim- ple, due to the randomness of evolution, proteins that share a purpose or structure may contain different amino-acid sequences, perhaps sharing a common sequence pattern. BLAST [Altschul et al. 1990] uses dynamic programming to perform a fuzzy string match between two proteins, pe- nalizing gaps in the match to evaluate a sequence similarity score. However, in practice, BLAST queries must be run several times for an operator to identify suitable gap-penalty Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC|05 November 12-18, 2005, Seattle, Washington, USA Copyright 2005 ACM 1-59593-061-2/05/0011...$5.00 values and get an appropriate number of hits for the task at hand. To mitigate the problem of choosing an ad-hoc gap penalty for a given BLAST search, Krogh et al. [1994] proposed bringing the probabilistic techniques of hidden Markov models(HMMs) to bear on the problem of fuzzy pro- tein sequence matching. HMMer [Eddy 2003a] is an open source implementation of hidden Markov algorithms for use with protein databases. One of the more widely used algo- rithms, hmmsearch, works as follows: a user provides an HMM modeling a desired protein family and hmmsearch processes each protein sequence in a large database, eval- uating the probability that the most likely path through the query HMM could generate that database protein sequence. This search requires a computationally intensive procedure, known as the Viterbi [1967; 1973] algorithm. The search could take hours or even days depending on the size of the database, query model, and the processor used. However, even given the lengthy execution time required to search a database, hmmsearch is widely used in the biol- ogy research community. For instance, Narukawa and Kad- owaki [2004] built a model of a typical trans-membrane sec- tion of a protein and used hmmsearch to search for pro- teins that contain this trans-membrane domain. Staub et al. [2001] trained a model with a few significant Spin/SSTY protein homologues that were repeated, hence important, in vertebrates and used this model to find similar amino- acid sequences in the non-redundant database. Clark and Berg [1998] leveraged their knowledge of the function of transcription factor protein TRA-1 as controlling sexual de- velopment in the C-Elegans worm and used hmmsearch to identify genes in the C-Elegans genome that potentially bind with TRA-1. Additional applications of hmmsearch to mole- cular and cell biology can be found in [Fawcett et al. 2000; S´anchez-Pulido et al. 2004; Bhaya et al. 2002]. The utility of HMMer in biological research and the un- wieldy query run times in its normal usage make hmmsearch an important candidate for acceleration. There has been a great deal of work on optimizing HMMer for traditional CPUs [Lindahl 2005; Cofer and SGI. 2002]. However, there is a new class of streaming processors currently available and forthcoming that require different optimization strate- gies. For example, modern graphics processors, GPUs, have been shown to provide very attractive compute and band- width resources [Buck 2005]. In this paper we present ClawHMMer, a streaming imple- mentation of hmmsearch, running on commodity graphics processors and parallel rendering clusters. We discuss the transformation of the traditional algorithm into a streaming algorithm and explore optimizing the algorithm for GPUs and future streaming processors. On the latest GPUs, our streaming implementation is on average three times as fast as a heavily optimized PowerPC G5 implementation and twenty-five times as fast as the standard Intel P4 imple- mentation.