GPU-based VP8 Encoding: Performance in Native and Virtualized Environments Pietro Paglierani Research & Development, Italtel S.p.A Castelletto di Settimo Milanese (MI) Email: pietro.paglierani@italtel.com Giuliano Grossi, Federico Pedersini and Alessandro Petrini Universit` a degli Studi di Milano – Dipartimento di Informatica Via Comelico 39, I-20135 Milano, Italy Email: {grossi, pedersini, petrini}@di.unimi.it Abstract—A key motivation behind the success of Cloud Com- puting is that virtualization allows signiﬁcant energy and cost savings by sharing physical resources. Another source of savings in virtualized architectures is the use of h/w accelerators (e.g. GPUs, FPGAs). This paper analyzes the performance achieved by a computationally demanding task running on a commodity server when a GPU-based accelerator is adopted. In the analysis, the VP8 video encoder has been used, with its most intensive functional block (motion estimation) implemented in the GPU. A simple but effective model to predict the achieved CPU usage savings is provided, and experimentally validated. The performance achieved with different numbers of simultaneous encoding sessions and used CPU cores is presented and discussed. The results show that the hybrid CPU-GPU implementation can provide computational time savings from 20% to 300%, without any quality degradation. The presented results have been obtained within the FP7 T-NOVA Project. I. I NTRODUCTION In the current telecommunication scenario, the transmission of video contents covers approximately 80% of the global network trafﬁc [1], and this percentage is expected to increase in the very near future. Hence, Network Functions based on video data processing, such as high resolution video transcod- ing, have assumed a role of primary relevance for network operators, because they allow the provision of high-quality, revenue-generating services. However, video functions, and in particular high resolution video coding (encoding/decoding), not only are among the most computationally-intensive work- loads, but they must also often fulﬁll very strict real-time constraints (for instance, a maximum allowed latency of one video frame, i.e. 20-30 ms). Thus, this type of functions, which are more and more often performed on standard commod- ity servers according to the Network Function Virtualization paradigm [2], would greatly beneﬁt from the support of HW accelerator resources, to ofﬂoad the X86 architecture CPU from the most intensive tasks [3]. Among the various types of HW accelerator resources now available for standard server architectures, General Purpose-GPUs represent the most cost- effective, if the inherently parallel GPU architecture can be fully exploited. Also, GPU programming has become a simpler task, when compared to other h/w acceleration resources (e.g. FPGA’s), owing to the high-level programming tools and languages made available by the GPU manufacturers, such as the open source OpenCL or the proprietary NVIDIA’s CUDA [4]. However, gaining signiﬁcant improvements in video encoding/decoding performance by entirely moving the encoding process to the GPU can be a very hard and time consuming task. The encoding process, in fact, can be hardly parallelized, because most video coding schemes (and, among them, H264 [5] and Google’s VP8 [6], are inherently sequential. In fact, the coding processes strongly relies on the spatial and temporal correlation of successive frames. For this reason, encoding and decoding assume that, while processing a macroblock, all previous macroblocks have already been processed. This eliminates the possibility to exploit massive data parallelism; in fact, simply processing many macroblocks in parallel is simply not feasible. In addition, all standard cod- ing schemes have been conceived for optimal CPU processing (e.g. macroblocks are small enough to be contained within the innermost levels of CPU cache), therefore they always result highly effective with respect to many parallel architecture. In this paper, we analyze the impact on video encoding performance that can be achieved through the use of a GPU, by adopting a hybrid approach, in which only some encoding functionalities are ofﬂoaded to the GPU, while the rest of the encoding functional blocks are still implemented on the CPU. This way, it is expected to obtain improved performance for the encoding process, at a lower implementation cost. To the end of the analysis, a version of the VP8 encoder with the motion estimation block implemented in the GPU has been implemented, and compared to the available CPU-only implementation. A simple but effective model to predict the CPU usage savings obtained by this approach is provided, and experimentally validated. Also, the performance achieved with different numbers of simultaneous encoding sessions and used CPU cores is presented and discussed. According to the experimental results, obtained from a test set of video sequences by using typical encoding conﬁgurations, the hy- brid CPU-GPU implementation provides signiﬁcant average computational time savings ranging from 20% to 300%, using from 1 up to 14 cores, respectively, with respect to the CPU- only solution. Moreover, on the basis of results from extensive tests, it turns out that the analyzed solutions can provide the same objective quality, both regarding the peak signal to noise ratio (PSNR) and the structured similarity index measure (SSIM). The paper is organized as follows: in section II, the VP8 encoding standard is described, and a new strategy for reducing overall encoding time by using GPU devices 2016 International Conference on Telecommunications and Multimedia (TEMU) 978-1-4673-8409-4/16/$31.00 ©2016 IEEE