A High-Performance Sofware Graphics Pipeline Architecture for the GPU MICHAEL KENZEL, BERNHARD KERBL, and DIETER SCHMALSTIEG, Graz University of Technology, Austria MARKUS STEINBERGER, Graz University of Technology, Austria and Max Planck Institute for Informatics, Germany (a) (b) (c) Fig. 1. Various scenes rendered by our sofware graphics pipeline in real-time on a GPU. (a) A smooth triangulation of the water surface in an animated ocean scene is achieved via a custom pipeline extension that allows the mesh topology to dynamically adapt to the underlying heightfield. (b) Scene geometry captured from video games like this still frame from Total War: Shogun 2 is used to evaluate the performance of our approach on real-world triangle distributions. (c) Many techniques such as mipmapping rely on the ability to compute screen-space derivatives during fragment shading. Our pipeline architecture can support derivative estimation based on pixel quad shading, used here to render a textured model of a heart with trilinear filtering; lower mipmap levels are filled with a checkerboard patern to visualize the efect. Total War: Shogun 2 screenshot courtesy of The Creative Assembly; used with permission. In this paper, we present a real-time graphics pipeline implemented entirely in software on a modern GPU. As opposed to previous work, our approach features a fully-concurrent, multi-stage, streaming design with dynamic load balancing, capable of operating efciently within bounded memory. We address issues such as primitive order, vertex reuse, and screen-space deriva- tives of dependent variables, which are essential to real-world applications, but have largely been ignored by comparable work in the past. The power of a software approach lies in the ability to tailor the graphics pipeline to any given application. In exploration of this potential, we design and implement four novel pipeline modifcations. Evaluation of the performance of our approach on more than 100 real-world scenes collected from video games shows rendering speeds within one order of magnitude of the hardware graphics pipeline as well as signifcant improvements over previous work, not only in terms of capabilities and performance, but also robustness. CCS Concepts: Computing methodologies Rasterization; Graph- ics processors; Massively parallel algorithms; Additional Key Words and Phrases: Software Rendering, GPU, Graphics Pipeline, Rasterization, CUDA ACM Reference Format: Michael Kenzel, Bernhard Kerbl, Dieter Schmalstieg, and Markus Steinberger. 2018. A High-Performance Software Graphics Pipeline Architecture for Authors’ addresses: Michael Kenzel, michael.kenzel@icg.tugraz.at; Bernhard Kerbl, bernhard.kerbl@icg.tugraz.at; Dieter Schmalstieg, dieter.schmalstieg@icg.tugraz.at, Graz University of Technology, Institute of Computer Graphics and Vision, Infeldgasse 16, Graz, 8010, Austria; Markus Steinberger, markus.steinberger@icg.tugraz.at, Graz University of Technology, Institute of Computer Graphics and Vision, Infeldgasse 16, Graz, 8010, Austria, Max Planck Institute for Informatics, Saarland Informatics Campus Building E1 4, Saarbrücken, 66123, Germany. © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The defnitive Version of Record was published in ACM Transactions on Graphics, https://doi.org/10.1145/3197517.3201374. the GPU. ACM Trans. Graph. 37, 4, Article 140 (August 2018), 15 pages. https://doi.org/10.1145/3197517.3201374 1 INTRODUCTION For a long time now, the hardware graphics pipeline has been the backbone of real-time rendering. However, while a hardware im- plementation can achieve high performance and power efciency, fexibility is sacrifced. Driven by the need to support an ever grow- ing spectrum of ever more sophisticated applications, the graph- ics processing unit (GPU) evolved as a tight compromise between fexibility and performance. The graphics pipeline on a modern GPU is implemented by special-purpose hardware on top of a large, freely-programmable, massively-parallel processor. More and more programmable stages have been added over the years. However, the overall structure of the pipeline and the underlying rendering algorithm have essentially remained unchanged for decades. While evolution of the graphics pipeline proceeds slowly, GPU compute power continues to increase exponentially. In addition to the graphics pipeline, modern application programming inter- faces (API) such as Vulkan [Khronos 2016b], OpenGL [Khronos 2016a], or Direct3D [Blythe 2006], as well as specialized interfaces like CUDA [NVIDIA 2016] and OpenCL [Stone et al. 2010] also allow the GPU to be operated in compute mode, which exposes the programmable cores of the GPU as a massively-parallel general- purpose co-processor. Although the hardware graphics pipeline remains at the core of real-time rendering, cutting-edge graphics applications increasingly rely on compute mode to implement ma- jor parts of sophisticated graphics algorithms that would not easily map to the traditional graphics pipeline such as, e.g., tiled deferred rendering [Andersson 2009], geometry processing (cloth simula- tion) [Vaisse 2014], or texel shading [Hillesland and Yang 2016]. ACM Trans. Graph., Vol. 37, No. 4, Article 140. Publication date: August 2018.