Exploiting Spatial Architectures for Edit Distance Algorithms Jesmin Jahan Tithi Stony Brook University jtithi@cs.stonybrook.edu Neal C. Crago VSSAD, Intel Corporation neal.c.crago@intel.com Joel S. Emer VSSAD, Intel Corporation / CSAIL, MIT joel.emer@intel.com, emer@csail.mit.edu Abstract—In this paper, we demonstrate the ability of spatial architectures to significantly improve both runtime performance and energy efficiency on edit distance, a broadly used dynamic programming algorithm. Spatial architectures are an emerging class of application accelerators that consist of a network of many small and efficient processing elements that can be exploited by a large domain of applications. In this paper, we utilize the dataflow characteristics and inherent pipeline parallelism within the edit distance algorithm to develop efficient and scalable implementations on a previously proposed spatial accelerator. We evaluate our edit distance implementations using a cycle- accurate performance and physical design model of a previously proposed triggered instruction-based spatial architecture in order to compare against real performance and power measurements on an x86 processor. We show that when chip area is normalized between the two platforms, it is possible to get more than a 50× runtime performance improvement and over 100× reduction in energy consumption compared to an optimized and vectorized x86 implementation. This dramatic improvement comes from lever- aging the massive parallelism available in spatial architectures and from the dramatic reduction of expensive memory accesses through conversion to relatively inexpensive local communication. I. I NTRODUCTION There is a continuing demand in many application domains for increased levels of performance and energy efficiency. While the number of transistors is expected to continue to scale with Moore’s law for at least the next five years, the “power wall” has dramatically slowed single-core processor perfor- mance scaling. Recently, several accelerator architectures have emerged to further improve performance and energy efficiency over multi-core processors in specific application domains. These architectures are tailored using the properties inherently found in these domains, and can range in programmability. Perhaps the best-known examples are fixed-function acceler- ators, which are tailored to single algorithms such as video decoding, and GPUs, which are fully programmable and target data parallel and SIMT-amenable code. Spatially programmed architectures are an emerging class of programmable accelerators that target application domains with workloads whose best-known implementation involves asynchronous actors performing different tasks, while fre- quently communicating with neighboring actors. Such archi- tectures are typically made up of hundreds of small processing elements (PEs) connected together via an on-chip network. When an algorithm is mapped onto a spatial architecture, the algorithm’s dataflow graph is broken into regions, which are connected by producer-consumer relationships. Input data is then streamed through this pipelined set of regions. The application domains that spatially-programmed architectures target spans a number of important areas such as signal processing, media codes, cryptography, compression, pattern matching, and sorting. Edit distance is a broad class of algorithms that find use in many important applications, spanning domains such as bioinformatics, data mining, text and data processing, natural language processing, and speech recognition. The edit distance problem determines the minimum number of “non-match” data edits to convert a source string or data object S to a target string or data object T . The algorithm also keeps track of the specific data edits required to convert S to T , from a set of four possible data edits: insertion, deletion, match or substitution. For example, if S = “computer” and T = “commute”, the minimum number of “non-match” edits to convert S to T is 2 and the edits are, MMMSMMMD (where M = Match, S = Substitute, D = Delete). The edit distance problem is ripe for acceleration, as the dynamic programming techniques typically used to solve the problem take O(nm) time and space, where m and n are the lengths of the strings S and T respectively. However, the nature of data dependences within the algorithm makes vectorization and parallelization non-trivial on modern CPUs and GPUs. On the other hand, these same data dependences have very nice dataflow properties which are quite naturally mapped on spatial architectures using pipeline parallelism [1], [2]. Moreover, the exploitation of pipeline parallelism also enables the conversion of many memory references into much less expensive local communication which further improves efficiency. In this paper, we study the potential of spatial architectures to solve the edit distance problem. Our main goal is to show that edit distance and similar algorithms naturally map to spatial architectures, improving performance and energy consumption substantially over general-purpose processors. We first build intuition on why spatial architectures are a good fit for edit distance and similar algorithms with local dependencies. We then describe three different algorithmic im- plementations of edit distance tailored to spatial architectures. We evaluate these implementations by conducting a detailed experimental analysis of performance and energy consump- tion on a triggered instruction-based spatial architecture [3]. Finally, we compare the performance and energy consumption of our implementations to highly optimized and vectorized x86 implementations. 1