2016 23 ◦ Encontro Portuguˆ es de Computac ¸˜ ao Gr´ afica e Interac ¸˜ ao (EPCGI) Non-Homogeneous Grids for CPU-GPU Ray Tracing Vasco Costa, Jo˜ ao M. Pereira and Joaquim A. Jorge INESC-ID / Instituto Superior T´ ecnico, University of Lisbon Lisbon, Portugal Email: vasco.costa@tecnico.ulisboa.pt, jap@inesc-id.pt, jaj@inesc-id.pt Abstract—Ray tracing is among the most resource consuming methods for realistic image generation. Over the years, different acceleration struc- tures have been proposed to reduce ray-object intersection queries since these dominate execution time. Regular grids are one of the most popular structures due to their simplicity and effectiveness. However, regular grid implementations are plagued by two major issues: underwhelming performance on irregular scenes with unbalanced triangle density and high memory consumption due to the many empty cells in sparsely populated scenes, typical of many game scenarios. We present a novel hybrid solution based on non-homogeneous rectilinear grids to improve ray tracing performance on uneven scene distributions. Additionally, we use hashing to get rid of empty cells. Non-homogeneous grids feature moveable split planes along the three axes unlike regular grids where split planes must be equidistant. Our approach performs serial construction tasks such as compression in the CPU and offloads the remaining data parallel tasks to the GPU. Using this acceleration structure we are able to render a wide range of scenes at high frame rates on commodity graphics hardware, from irregular density low polygon count models to regular density high polygon count scanned scenes with rapid construction times and a small memory footprint. For some test cases, our approach nearly doubles the frame rate of a regular grid at a similar resolution, while featuring low build times. Index Terms—Raytracing, GPU, spatial subdivision, rectilinear, grid. I. I NTRODUCTION Stream computing provides an increasingly abundant amount of floating point resources. These computing architectures can run lightweight threads and are typically connected to high bandwidth memory interfaces. However stream computing platforms also exhibit limitations: the comparatively low amount of memory available, a lower tolerance to branch divergence, and higher latency memory access. This means that highly hierarchical data structures, in par- ticular trees, may incur higher penalties due to misspredictions than other, more regular, data structures such as arrays. It also means one is often better off doing more computations rather than accessing cached values in a lookup table to avoid possible stalls. This is one reason for our interest on grid spatial subdivision structures for GPU ray tracing rather than working on tree-based approaches. Another reason is that grids feature O(n) construction time for n primitives and O(1) time to access any given cell as compared to tree based structures using high quality surface area heuristics [1], [2] which feature O(n log n) super linear construction times [3] and O(log n) logarithmic access times. This super linear computation cost becomes untenable when attempting to render animated scenes, especially those featuring destructible geometry, or in tasks which require the rapid visualization of large models, featuring tens of millions of triangles, without long waiting times due to costly pre- processing methods. Hence our focus on grid acceleration structures which do not suffer from these limitations. Non-hierarchical grids have traditionally suffered from poor perfor- mance when rendering non-homogeneous scenes featuring polygon soups of irregular density. One example is the ”teapot in a stadium problem” which manifests itself when high polygon count objects lie inside lower polygon count boxes. For non-adaptive regular grids such a scene will typically lead to an overabundance of polygons for a small set of cells causing a drop in frame rate when rendering the scene at that spot. Our algorithm seeks to solve this problem by relaxing the placement of the split planes. To this end, rather than dividing the scene into equal volume cells, as a regular grid would do, we divide the scene into cells with similar polygon counts, using a non-homogeneous grid (NHG), leading to better load balancing on average when rendering the scene. Our contributions include novel algorithms to rapidly construct NHGs on a hybrid CPU-GPU platform, as well as an efficient algorithm for GPU NHG traversal. These algorithms allow ren- dering complex scenes in real time, outperforming state-of-the art approaches, in terms of memory consumption vs speed trade offs. The organization of this paper is as follows: we survey previous related work in detail, then we present our new algorithms for NHG construction and traversal on a stream computing architecture. Next, we describe testing methods we adopted and present performance figures to compare the performance of our technique to the state of the art. Finally we discuss results and present ideas for future work. II. RELATED WORK According to Whitted [4] up to 90% of the time spent while ray tracing a scene can be attributed to ray-object intersections. In order to attain real-time performance in ray tracing some sort of acceleration scheme must be employed. Current ray tracing acceleration structure research is primarily focused into three types of partition structures: the bounding volume hierarchies (BVHs) introduced by Whitted, the kd-trees referred by Kaplan [5] and the grids described by Fujimoto et al [6]. BVHs and kd-trees are, respectively, n-ary object partitioning trees and binary space partitioning trees. Due to the recursive nature of these kinds of data structures the traversal methods typically require the use of a stack. This can be problematic in machines with no hardware stack support or a limited register file size as was the case in earlier GPUs [7]. It is possible to avoid the need of a stack for back- tracking by storing skip pointers [8] or ropes [9] in the tree which is stored in a linear array form. However these techniques increase the amount of memory required to store the tree, and lead to cache trashing issues, thus they are no longer of much use in current GPU architectures with hardware stack support. In order to ensure a high quality layout for the partitioning structure typically a surface area heuristic (SAH) [1], [2] is employed to compute the hierarchy in an automated fashion. Due to the tree based nature of these acceleration structures SAH schemes typically take 978-1-5090-5387-2/16/$31.00 2016 IEEE