THE 2D DISCRETE WAVELET TRANSFORM ON PROGRAMMABLE GRAPHICS HARDWARE Christian Tenllado, Roberto Lario, Manuel Prieto, and Francisco Tirado Dpto. Arquitectura de Computadores y Automática Universidad Complutense Madrid Spain E-mail: {tenllado, rlario, mpmatias, tirado}@dacya.ucm.es ABSTRACT The growing popularity of the Discrete Wavelet Transform (DWT) has boosted its tuning on all sorts of computer systems, from special purpose hardware for embedded systems to general purpose microprocessors and multiprocessors. In this paper we continue to investigate possibilities for the implementation of the DWT, focusing on state-of-the-art programmable graphics hardware. Current design trends have transformed these devices into powerful coprocessors with enough flexibility to perform intensive and complex floating-point calculations. This study is concentrated on the comparison between the most popular implementation alternatives, known as the lifting and filter- bank algorithms. The characteristics of the filter-bank version suggest a better mapping on current graphics hardware, given that they present a higher degree of parallelism. However, our experiments show that the lifting algorithm, which exhibits lower computational demands, can be efficiently tailored to provide best results despite the data dependencies involved in this scheme, which makes the exploitation of data parallelism more difficult. KEY WORDS: image processing, programmable graphics processors. 1. Introduction Application specific designs have been extensively used during the last decade in order to meet the computational demand of computer graphics and media processing. However, the difficulties that arise in adapting specific designs to the evolution of applications have hastened their decline in favour of other architectures which feature programmable capabilities. At the other extreme of the design spectrum we find general purpose architectures. However, they are also unsuited to satisfy media demands given that in these programmable architectures the cost of delivering instructions to the ALUs becomes a serious bottleneck under media workloads. Current GPUs seem to have taken the best from both worlds. From special purpose architectures they take control and communication structures that enable the effective use of many ALUs. From general purpose architecture, they take enough flexibility to allow a programming model. In this paper we have explored the use of this kind of programmable platform to compute the discrete wavelet transform (DWT). This exploration is of great practical interest given the growing importance of this tool in recent years. Since its introduction it has encouraged the use of multiresolution techniques. In addition, the efficiency and simplicity of its implementation has favoured the extensive utilization of this algorithm. For example, to cite a few topics related with computer graphics, it has been successfully applied in image compression, image fusion, global illumination, hierarchical modelling, volume rendering and processing [23]. The main goal of this research is to study how to explicitly adapt or tune the DWT computation to a stream-based programming model in order to take advantage of modern graphics hardware. It is our opinion that this study or revision is not only of great practical interest, as mentioned above, but it also provides certain insights into the potential benefits of these relatively new capabilities and how to take advantage of them. The rest of this paper is organized as follows: Section 2 and 3 describe the DWT and the target computing platform respectively. Section 4 outlines the proposed GPU-aware implementations. Performance results are reported in Section 5. Section 6 and 7 end the paper with a related work summary and the main conclusions of this study. 2. The Discrete Wavelet Transform The DWT has been traditionally implemented using two different schemes, known as the filter bank and lifting algorithms. The filter bank version is based on a pair of Quadrature Mirror Filters. Figure 1 illustrates the 1D version of this approach: the discrete signal is convolved with a lowpass filter H(z) and a highpass filter G(z), whose outputs are downsampled afterwards to obtain a coarse scale approximation (lower band, LP in Figure 1) and a detail signal (higher band, BP in Figure 1). h z 1 g z 1 2 LP BP + h z g z 2 2 2 Figure 1: Discrete wavelet transform. The full decomposition is obtained by iterating this process on the low-pass branch. Equation 1 shows the mathematic definition of one stage of this decomposition process. = = k k j n k j n k k j n k j n s g BP s h LP 2 1 2 1 (1)