Exploiting Integrated GPUs for Network Packet Processing Workloads Janet Tseng, Ren Wang, James Tsai, Saikrishna Edupuganti, Alexander W. Min Shinae Woo , Stephen Junkins and Tsung-Yuan Charlie Tai Intel Labs & Intel Visual and Parallel Computing Group Department of Computer Science, KAIST Email: {janet.tseng, ren.wang, james.tsai, saikrishna.edupuganti, alexander.w.min, stephen.junkins, charlie.tai}@intel.com shinae@an.kaist.ac.kr Abstract—Software-based network packet processing on stan- dard high volume servers promises better flexibility, manage- ability and scalability, thus gaining tremendous momentum in recent years. Numerous research efforts have focused on boosting packet processing performance by offloading to discrete graphics processing units (GPUs). While integrated GPUs, residing on the same die with the CPU, offer many advanced features such as on-chip interconnect CPU-GPU communication, and shared physical/virtual memory, their applicability for packet processing workloads has not been fully understood and exploited. In this paper, we conduct in-depth profiling and analysis to understand the integrated GPU’s capabilities, and performance potential for packet processing workloads. Based on that understanding, we in- troduce a GPU accelerated network packet processing framework that fully utilizes integrated GPU’s massive parallel processing capability without the need for large numbers of packet batching, which might cause a significant processing delay. We implemented the proposed framework and evaluated the performance with several common, light-weight packet processing workloads on the Intel R Xeon R Processor E3-1200 v4 product family (code- name Broadwell) with an integrated GT3e GPU. The results show that our GPU accelerated packet processing framework improved the throughput performance by 2–2.5x, compared to optimized CPU-only for packet processing. I. I NTRODUCTION Recent years have witnessed the fast deployment of software-defined networking (SDN) and Network Functions Virtualization (NFV) in data center environments and telecom- munication providers. The trend propels the need for high- speed software-based packet processing on standard high volume servers. However, the recent surge of network I/O bandwidth, with 100 Gbps Ethernet coming to market, has put pressure on CPUs, caches, and the system memories of multicore servers to sustain both packet processing and NFV services [1]. GPUs have emerged as a promising candidate for offload- ing network packet processing workloads from the CPU in order to achieve higher full-system performance and freeing more CPU cycles for other application services, e.g., packet forwarding [2], [3], [4], [5], Secure Sockets Layer (SSL) encryption/decryption [6], and regular expression matching in intrusion detection systems [7]. However, most of existing approaches have focused on designing offloading architecture for discrete GPUs. Studies have shown [5], [8] that for discrete GPUs, large batches of packets need to be sent to GPU units to amortize the high CPU-GPU communication latency via the PCIe (PCI Express ) bus. This could impact the packet processing latency significantly, especially for high-speed data center or carrier networks where such extra latency is not desirable. Meanwhile, proper system optimization for packet processing, such as group-prefetching and software pipelining, on the CPU side has proven to be effective in improving CPU performance [8]. Recently, integrated GPUs—where the CPU and GPU are located on the same die and communicate through on-chip interconnect instead of PCIe—have become more popular in modern server architectures. Integrated GPUs often offer advantages, over the conventional discrete GPUs, of reducing total system power and cost. The additional computational re- sources provided by such integrated GPUs could be beneficial for network packet processing workloads. This paper aims to fully exploit using integrated GPUs for offloading packet processing workloads in modern server architectures. To this purpose, we first designed several micro- benchmarks to profile and understand the integrated GPU ca- pabilities regarding packet processing workloads, for example, computation capabilities, communication latency, and random memory access speed. Armed with that insight, we propose a new network packet processing architecture to take advantage of both CPU and integrated GPUs to achieve better system performance. Our proposed framework is carefully designed to incorporate several specific features that are feasible only with integrated GPUs, including (i) Continuous Threads to eliminate the need for launching kernel for every batch of packets, and (ii) a multi-buffering technique that further hides communication latency without requiring large packet batches. We implement the proposed architecture on the platform based on the Intel R Xeon R Processor E3-1200 v4 product family. We then characterize and evaluate the performance of Intel’s integrated GPUs for several network packet processing workloads. We focus on evaluating packet processing applica- tions with relatively light computation tasks, aiming to answer the fundamental question of whether using integrated GPUs is beneficial for common, lightweight packet processing; it is well understood that heavier computational workloads will benefit from the GPU’s massive parallel computation power more [8]. Our experiments show that our carefully designed and optimized CPU-GPU packet processing framework can effectively improve the CPU only throughput performance by 2–2.5x, without batching large number of packets. II. MOTIVATION AND BACKGROUND Network packet processing workload is inherently highly parallelizable at the packet or flow level. Therefore, GPUs with multiple execution units can be a good alternative compute resource for such workloads, with packets being effectively distributed into hundreds or thousands of cores. Batching