GPU-based VP8 Encoding: Performance in Native
and Virtualized Environments
Pietro Paglierani
Research & Development, Italtel S.p.A
Castelletto di Settimo Milanese (MI)
Email: pietro.paglierani@italtel.com
Giuliano Grossi, Federico Pedersini and Alessandro Petrini
Universit` a degli Studi di Milano – Dipartimento di Informatica
Via Comelico 39, I-20135 Milano, Italy
Email: {grossi, pedersini, petrini}@di.unimi.it
Abstract—A key motivation behind the success of Cloud Com-
puting is that virtualization allows significant energy and cost
savings by sharing physical resources. Another source of savings
in virtualized architectures is the use of h/w accelerators (e.g.
GPUs, FPGAs). This paper analyzes the performance achieved
by a computationally demanding task running on a commodity
server when a GPU-based accelerator is adopted. In the analysis,
the VP8 video encoder has been used, with its most intensive
functional block (motion estimation) implemented in the GPU.
A simple but effective model to predict the achieved CPU
usage savings is provided, and experimentally validated. The
performance achieved with different numbers of simultaneous
encoding sessions and used CPU cores is presented and discussed.
The results show that the hybrid CPU-GPU implementation
can provide computational time savings from 20% to 300%,
without any quality degradation. The presented results have been
obtained within the FP7 T-NOVA Project.
I. I NTRODUCTION
In the current telecommunication scenario, the transmission
of video contents covers approximately 80% of the global
network traffic [1], and this percentage is expected to increase
in the very near future. Hence, Network Functions based on
video data processing, such as high resolution video transcod-
ing, have assumed a role of primary relevance for network
operators, because they allow the provision of high-quality,
revenue-generating services. However, video functions, and in
particular high resolution video coding (encoding/decoding),
not only are among the most computationally-intensive work-
loads, but they must also often fulfill very strict real-time
constraints (for instance, a maximum allowed latency of one
video frame, i.e. 20-30 ms). Thus, this type of functions, which
are more and more often performed on standard commod-
ity servers according to the Network Function Virtualization
paradigm [2], would greatly benefit from the support of HW
accelerator resources, to offload the X86 architecture CPU
from the most intensive tasks [3]. Among the various types
of HW accelerator resources now available for standard server
architectures, General Purpose-GPUs represent the most cost-
effective, if the inherently parallel GPU architecture can be
fully exploited. Also, GPU programming has become a simpler
task, when compared to other h/w acceleration resources (e.g.
FPGA’s), owing to the high-level programming tools and
languages made available by the GPU manufacturers, such
as the open source OpenCL or the proprietary NVIDIA’s
CUDA [4]. However, gaining significant improvements in
video encoding/decoding performance by entirely moving
the encoding process to the GPU can be a very hard and
time consuming task. The encoding process, in fact, can be
hardly parallelized, because most video coding schemes (and,
among them, H264 [5] and Google’s VP8 [6], are inherently
sequential. In fact, the coding processes strongly relies on the
spatial and temporal correlation of successive frames. For this
reason, encoding and decoding assume that, while processing
a macroblock, all previous macroblocks have already been
processed. This eliminates the possibility to exploit massive
data parallelism; in fact, simply processing many macroblocks
in parallel is simply not feasible. In addition, all standard cod-
ing schemes have been conceived for optimal CPU processing
(e.g. macroblocks are small enough to be contained within the
innermost levels of CPU cache), therefore they always result
highly effective with respect to many parallel architecture.
In this paper, we analyze the impact on video encoding
performance that can be achieved through the use of a GPU,
by adopting a hybrid approach, in which only some encoding
functionalities are offloaded to the GPU, while the rest of
the encoding functional blocks are still implemented on the
CPU. This way, it is expected to obtain improved performance
for the encoding process, at a lower implementation cost. To
the end of the analysis, a version of the VP8 encoder with
the motion estimation block implemented in the GPU has
been implemented, and compared to the available CPU-only
implementation. A simple but effective model to predict the
CPU usage savings obtained by this approach is provided,
and experimentally validated. Also, the performance achieved
with different numbers of simultaneous encoding sessions and
used CPU cores is presented and discussed. According to
the experimental results, obtained from a test set of video
sequences by using typical encoding configurations, the hy-
brid CPU-GPU implementation provides significant average
computational time savings ranging from 20% to 300%, using
from 1 up to 14 cores, respectively, with respect to the CPU-
only solution. Moreover, on the basis of results from extensive
tests, it turns out that the analyzed solutions can provide
the same objective quality, both regarding the peak signal to
noise ratio (PSNR) and the structured similarity index measure
(SSIM). The paper is organized as follows: in section II,
the VP8 encoding standard is described, and a new strategy
for reducing overall encoding time by using GPU devices
2016 International Conference on Telecommunications and Multimedia (TEMU)
978-1-4673-8409-4/16/$31.00 ©2016 IEEE