1 GPU Acceleration for Particle Filter based LDPC Decoding Shuang Wang, Lijuan Cui, Samuel Cheng and Robert C. Huck School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang, lj.cui, samuel.cheng, rchuck}@ou.edu Abstract—A parallel belief propagation algorithm based on Particle Filtering (PF) for channel estimation and Low-Density Parity-Check (LDPC) decoding is presented in this paper based on Compute Unified Device Architecture (CUDA). The authors have found that compared with the traditional Belief Propagation (BP) algorithm with fixed estimated noise power, BP algorithm based on PF [1] not only gives a good real-time estimate for the channel noise, but it also achieves a lower decoding error rate. However, the implementation of PF algorithm increased the decoding complexity. As a new hardware and software architecture for addressing and managing computations, CUDA offers parallel data computing using the highly multithreaded coprocessor driven by very high memory bandwidth GPU. The parallel noise adaptive decoding algorithm, based on CUDA, allows variable nodes, factor nodes or particles to be updated simultaneously, thus provideing an efficient and fast way for implementing the decoder. I. I NTRODUCTION As a type of error-correcting code, Low-Density Parity- Check (LDPC) codes were first proposed by Gallager in the early 1960s [2] and revived by Mackay and Neal in 1996 [3]. From that time, LDPC code has raised wide interest in the research community because the performance of LDPC code can make data transmission rates achieve near Shannon limit [3]. LDPC codes can be decoded by using a powerful iterative algorithm known as the Belief Propagation (BP) algorithm [3]. For LDPC decoding over the Additive White Gaussian Noise (AWGN) channel, the knowledge of the estimated noise power is one of the most important parameters to achieve the best performance of the BP algorithm [4]. However, for an unknown or a time-varying AWGN channel, it is very difficult to estimate the noise power without a pilot signal or feedback. To overcome these problems, in [1], we propose a BP algorithm based on Particle Filtering (PF) for LDPC decoding over an AWGN channel. The proposed algorithm is carried out based on a factor graph [5], which affords great flexibility in modeling systems. We show that the proposed algorithm no longer depends on the initial estimation of noise power σ 2 and offers a good real-time estimation for channel noise power. For different code rates, our algorithm shows a lower decoding error rate than that of a standard BP algorithm. However, the implementation of the PF algorithm dramatically increased de- coding complexity, especially when a large number of particles were used to estimate the channel noise power. A parallel decoding method, by taking advantage of GPU architectures, offers an efficient solution to accelerate this procedure. In just a few years, GPUs have evolved into flexible platforms for general computing [6]. Initially, GPUs were programmed by low-level languages [7] which restricted its application as computing workhorses. The release of Cg, a high-level programming language for the GPU, facilitated the application of a GPU for a general purpose computation [8]. However, Cg is not user-friendly enough, because it required programmers to have fundamental knowledge of computer graphics for using this high-level programming language. Now that NVIDIA has released the CUDA [6], programmers can write codes for both CPUs and GPUs in a similar way by using the instruction set of CUDA. In many related fields [9]–[11], CUDA has been verified as a compute-intensive and highly parallel computing workhorse. In [9], a fast 3D tracking of multiple faces using a particle filter based on CUDA was introduced. An important speed boost was observed, especially when a large number of particles are used, which makes the tracker eminently more suitable for real-time processing in a standard PC platform. In [10], an efficient CUDA-based implementation of the BP algorithm was described that sped up stereo image processing and motion tracking calculations. In [11], we presented that CUDA offers a highly parallel architecture and a significant increase in performance compared with the computation on a CPU for LDPC decoding based on the BP algorithm. In this paper, we propose a parallel BP algorithm based on PF for noise adaptive LDPC decoding by using the CUDA programming model. We show how the parallelism naturally appeared during message-passing and that particle updating can be exploited using the CUDA model, which allows particles, variable nodes, or factor nodes to be updated simultaneously. Thus, a significant increase of performance is observed when the number of particles and the size of the parity check matrix are reasonably large. This paper is structured as follows. In Section II, we explain the BP algorithm based on PF in factor graph. Then, the corresponding parallel algorithms are described in Section III. Finally, in Section IV we present simulation results and in Section V we draw the concluding remarks.