Parallel Point Reprojection Erik Reinhard Peter Shirley Charles Hansen University of Utah www.cs.utah.edu Abstract Improvements in hardware have recently made interactive ray trac- ing practical for some applications. However, when the scene com- plexity or rendering algorithm cost is high, the frame rate is too low in practice. Researchers have attempted to solve this problem by caching results from ray tracing and using these results in multi- ple frames via reprojection. However, the reprojection can become too slow when the number of samples that are reused is high, so previous systems have been limited to small images or a sparse set of computed pixels. To overcome this problem we introduce tech- niques to perform this reprojection in a scalable fashion on multiple processors. CR Categories: I.3.7 [Computing Methodologies ]: Computer Graphics—3D Graphics Keywords: point reprojection, ray tracing 1 Introduction Interactive Whitted-style ray tracing has recently become feasible on high-end parallel machines [5,6]. However, such systems only maintain interactivity for relatively simple scenes or small image sizes. By reusing samples instead of relying on brute force approaches, these limitations can be overcome. There are several ways to reuse samples. All of them require interpolating between existing sam- ples as the key part of the process. First, rays can be stored along with the color seen along them. The color of new rays can be in- terpolated from existing rays [1,4]. Alternatively, the points in 3D where rays strike surfaces can be stored and then woven together as displayable surfaces [7]. Finally, such points can be directly pro- jected to the screen, and holes can be filled in using image process- ing heuristics [8]. Another method to increase the interactivity of ray tracing is frameless rendering [2, 3, 6, 9]. Here, a master processor farms out single pixel tasks to be traced by the slave processors. The order in which pixels are selected is random or quasi-random. When- ever a renderer finishes tracing its pixel, it is displayed directly. As pixel updates are independent of the display, there is no con- cept of frames. During camera movements, the display will deteri- orate somewhat, which is visually preferable to slow frame-rates in frame-based rendering approaches. It can therefore handle scenes of higher complexity than brute force ray tracing, although no sam- ples are reused. The main thrust of this paper is the use of parallelism to increase data reuse and thereby increase allowable scene complexity and im- age size without affecting perceived update rates. We use the render cache of Walter et al. [8] and apply to it the concept of frameless rendering. By distributing this algorithm over many processors we are able to overcome the key bottleneck in the original render cache work. We demonstrate our system on a variety of scenes and image sizes that have been out of reach for previous systems. project points process image cache of colored 3D points request samples display front end CPU many CPUs for tracing rays rays new points loop Figure 1: The serial render cache algorithm [8]. 2 Background: the render cache The basic idea of the render cache is to save samples in a 3D point cloud, and reproject them when viewing parameters change [8]. New samples are requested all over the screen, with most samples concentrated near depth discontinuities. As new samples are added old samples are eliminated from the point cloud. The basic process is illustrated in Figure 1. The front-end CPU handles all tasks other than tracing rays. Its key data structure is the cache of colored 3D points. The front end continuously loops, first projecting all points in the cache into screen space. This will produce an image with many holes, and the image is processed to fill these holes in. This filling-in process uses sample depths and heuristics to make the processed image look reasonable. The pro- cessed image is then displayed on the screen. Finally, the image is examined to find “good” rays to request to improve future images. These new rays are traced by the many CPUs in the “rendering farm”. The current frame is completed after the front end receives the results and inserts them into the point cloud. From a parallel processing point of view, the render cache has the disadvantage of a single expensive display process that needs to feed a number of renderers with sample requests and is also respon- sible for point reprojection. The display process needs to insert new results into the point cloud, which means that the more renderers are used, the heavier the workload of the display process. Hence, the display process quickly becomes a bottleneck. In addition, the number of points in the point cloud is linear in image size, which means that the reprojection cost is linear in image size. The render cache was shown to work well on 256x256 images using an SGI Origin 2000 with 250MHz R10k processors. At higher resolutions than 256x256, the front end has too many pix- els to reproject to maintain fluidity.