Column Scan Acceleration in Hybrid CPU-FPGA Systems Nusrat Jahan Lisa, Annett Ungeth¨ um, Dirk Habich, Wolfgang Lehner Technische Universit¨ at Dresden Database Systems Group Dresden, Germany {firstname.lastname}@tu-dresden.de Nguyen Duy Anh Tuan, Akash Kumar Technische Universit¨ at Dresden Processor Design Group Dresden, Germany {firstname.lastname}@tu-dresden.de ABSTRACT Nowadays, in-memory column store database systems are state-of-the-art for analytical workloads. In these column stores, a full column scan is a fundamental key operation and thus, the optimization of this primitive is very crucial from a performance perspective. For this optimization, ad- vances in hardware are always an interesting opportunity, but represent also a major challenge. At the moment, hard- ware systems are more and more changing from homoge- neous CPU systems towards hybrid systems with different computing units. Based on that, we focus on column scan acceleration for hybrid hardware systems incorporating a Field Programmable Gate Array (FPGA) and a CPU into a single system in this paper. The advantage of those hy- brid systems is that the FPGA has usually direct access to the main memory of the CPU avoiding data copy which is a necessary procedure in other hybrid systems like CPU-GPU architectures. Thus, we present several FPGA designs for a recent column scan technique to fully offload the scan opera- tion to the FPGA. In detail, we present our basic FPGA de- sign and different optimization techniques. Then, we present selective results of our exhaustive evaluation showing the benefit of our FPGA acceleration. As we are going to show, we achieve a maximum speedup of factor 7 compared to a single-threaded CPU scan execution. 1. INTRODUCTION In our data-driven world, efficient query processing is still an important aspect due to the ever-growing amount of data. In fact, the growth of data even outnumbers Moore’s law of digital circuit complexity [35]. Therefore, the architecture of database systems is constantly evolving, especially by adapt- ing novel hardware features to satisfy response times and throughput demands [6, 20, 25, 30, 33]. For instance, the database architecture shifted from a disk-oriented to a main memory-oriented architecture to efficiently exploit the ever- increasing capacities of main memory [1, 22, 27, 37]. This in-memory database architecture is now state-of-the-art and characterized by the fact, that all relevant data is completely stored and processed in main memory. Additionally, rela- tional tables are organized by column rather than by row [1, 6, 8, 22, 37] and the traditional tuple-at-a-time query pro- cessing model was replaced by newer and adapted processing models like column-at-a-time or vector-at-a-time [1, 6, 22, 37, 48]. To further increase the performance of queries, in particu- lar for analytical queries in these in-memory column stores, two key aspects play an important role. On the one hand, data compression is used to tackle the continuously increas- ing gap between computing power of CPUs and memory bandwidth (also known as memory wall [6]) [2, 5, 9, 21, 47]. Aside from reducing the amount of data, compressed data offers several advantages such as less time spent on load and store instructions, a better utilization of the cache hier- archy, and less misses in the translation lookaside buffer. On the other hand, in-memory column stores constantly adapt to novel hardware features like vectorization using Single-Instruction Multiple Data (SIMD) extensions [34, 48], GPUs [20, 26] or non-volatile main memory [33]. From a hardware perspective, we currently observe a shift from homogeneous CPU systems towards hybrid systems with different computing units mainly to overcome physical limits of homogeneous systems [11, 29]. In particular, hybrid hardware systems incorporating a Field Programmable Gate Array (FPGA) and a CPU are emerging, being very inter- esting from a performance perspective. Generally, FPGAs are integrated circuits, which are configurable after being manufactured. Thus, FPGAs can be used as a hardware extension to the database system where some specialized functionality is efficiently implemented. Additionally, FP- GAs have usually direct access to the main memory of the CPU in such hybrid systems. In contrast to other hybrid systems like CPU/GPUs, this direct main memory access is unique regarding to avoid the bottleneck of copying data between the different computing units [14, 26]. Our Contribution A core primitive in in-memory column stores is a column scan [12, 31, 40], because analytical queries usually com- pute aggregations over full or large parts of columns. Thus, the optimization of this scan primitive is very crucial from a performance perspective and several software-based ap- proaches have been proposed [12, 31, 40]. Some of these approaches are already tailored to hardware features like SIMD vectorization as optimization [12, 40]. Generally, the task of a column scan is to compare each entry of a given 1