Acceleration of Variance of Color Differences-Based Demosaicing Using CUDA Muhammad Ismail Faruqi * , Fumihiko Ino † , and Kenichi Hagihara ‡ Graduate School of Information Science and Technology, Osaka University 1-5 Yamadaoka, Suita, Osaka 565-0871, Japan Email: {i faruqi * , ino † , hagihara ‡ }@ist.osaka-u.ac.jp Abstract—Image demosaicing algorithms are used to reconstruct a full color image from the incomplete color samples output (RAW data) of an image sensor overlaid with a Color Filter Array (CFA). Better demosaicing algorithms are superior in terms of acuity, dynamic range, signal to noise ratio, and artifact suppression, which make them suitable for high quality delivery such as theatrical broadcast. In this paper, we present our efforts in examining the feasibility of exploiting the Graphics Processing Unit (GPU) as an emerging accelerator to create an on-the-ﬂy implementation of Variance of Color Differences (VCD) demosaicing, a state-of-the-art heuristic demosaicing algorithm developed to eliminate false-color artifacts in texture region of images. Our efforts in this paper are 1) implementing the algorithm as several kernels to separate the bottleneck portion of the algorithm from the rest and to minimize idle threads and 2) reducing I/O between shared and global memory when performing green channel interpolation by separating the input RAW data into four channels. We then compare the implementation featuring both acceleration methods with a single kernel implementation. Based on experimental results, our proposed acceleration methods achieved per-frame processing time of 343 ms on an nVidia GTX 480, which translates into 2.95 fps. Additionally, our proposed methods were also able to accelerate the kernel time and the effective memory bandwidth by a factor of 2.1x compared with its single kernel counterpart. Keywords—Parallel processing; image demosaicing; CUDA; GPU I. I NTRODUCTION Image demosaicing [1] is an integral part of color imaging pipeline. It is the ﬁrst step of the pipeline, in which the luminance data known as RAW data in each photosite is expanded into RGB values. One method to evaluate the quality of demosaicing algorithms is to measure how effective they approximate the remaining color values while not introducing artifact known as moir´ e. The less moir´ e introduced in the de- mosaiced images, the better quality the demosaicing algorithm is. However, sophisticated demosaicing algorithms tend to be computationally expensive and impractical to be implemented on almost all digital cameras. Therefore, demosaicing algo- rithms used in many digital cameras are generally favoring execution speed over quality, resulting pictures with moir´ e. Hence, to enable users accessing higher quality images, higher-end cameras offer a feature to directly record RAW data into storage without going through the entire imaging pipeline. Such data is then processed by some external processors capable executing demosaicing in shorter time. In another spectrum, users of video cameras equipped with Bayer ﬁlter [2] have suffered by the long time and huge space required to acquire high quality ﬁles from their camera. To process the ﬁles, users have to wait for a long time until the demosaicing algorithm ﬁnishes processing the frames, and then saving the demosaiced ﬁle into storage. Here lie the challenges this paper is going to solve: to enable users acquire high quality demosaicing result from high resolution video ﬁles quickly without having to store large demosaicing result into storage. In this paper, we propose an acceleration of Variance of Color Differences (VCD) [3]-based demosaicing, a high quality demosaicing algorithm speciﬁcally developed to combat moir´ e in texture region of images, using Compute Uniﬁed Device Architecture (CUDA) [4]. The objective of this implemen- tation is to demosaic video RAW ﬁles on-the-ﬂy as fast as possible, so that the video editing workﬂow will be accelerated and the storage requirement to work on demosaicing result can be eliminated. To achieve this, we ﬁrst introduce the wavefront processing as the base method of the algorithm parallelization. We then propose two implementation methods, which are 1) implementing the algorithm as multiple kernels to separate the bottleneck portion of the algorithm and to minimize idling threads, and 2) reducing input and output transfer between global and shared memory [4] in the green channel interpolation phase by separating the input RAW data into separate channels. II. RELATED WORK Several works have been dedicated to implement demosaicing using GPU. For example, McGuire [5] accelerated Malvar- He-Cutler [6] image demosaicing algorithm using OpenGL in real-time speed. Fung et al. [7] show two examples of CUDA- based demosaicing based on bilinear and Lanczos [8] method. However, these algorithms have inferior moir´ e suppression compared to Chung’s algorithm. The ﬁrst commercial appli- cation known to deliver real-time preview and grading for 4K RAW was IRIDAS [10] Speedgrade XR which was launched