(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 3, 2017 Area and Energy Efficient Viterbi Accelerator for Embedded Processor Datapaths Abdul Rehman Buzdar * , Liguo Sun * , Muhammad Waqar Azhar ‡ , Muhammad Imran Khan †§ , Rao Kashif † * Department of Electronic Engineering and Information Science † Micro/Nano Electronic System Integration R & D Center (MESIC) University of Science and Technology of China (USTC), Hefei, China ‡ Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden § Department of Electronics Engineering, University of Engineering and Technology Taxila, Pakistan Abstract—Viterbi algorithm is widely used in communication systems to efficiently decode the convolutional codes. This algo- rithm is used in many applications including cellular and satel- lite communication systems. Moreover, Serializer-deserializers (SERDESs) having critical latency constraint also use viterbi algorithm for hardware implementation. We present the inte- gration of a mixed hardware/software viterbi accelerator unit with an embedded processor datapath to enhance the processor performance in terms of execution time and energy efficiency. Later we investigate the performance of viterbi accelerated em- bedded processor datapath in terms of execution time and energy efficiency. Our evaluation shows that the viterbi accelerated Microblaze soft-core embedded processor datapath is three times more cycle and energy efficient than a datapath lacking a viterbi accelerator unit. This acceleration is achieved at the cost of some area overhead. Keywords—Viterbi decoder; Codesign; FPGA; MicroBlaze; Em- bedded Processor I. I NTRODUCTION Channel coding is used in wireless communication systems for reliable data transfer over noise prone communication channels. Various forward error correction (FEC) schemes e.g. Low-density parity-check (LDPC), Reed Solomon, Viterbi and Turbo codes are used to meet the growing need to improve the spectrum efficiency [1], [2], [3], [4], [5]. In FEC schemes the encoding of data is done using convolutional encoding and at the receiver end the decoding process is done by viterbi or turbo decoders [21-31]. The viterbi decoder is suitable in wireless communication systems in which the transmitted signals are corrupted by additive white Gaussian noise [6]. The decoding process in FEC schemes is computationally intensive and power hungry. The hand held devices are battery powered, so they must be energy efficient. The customized hardware implementation of these FEC decoders are perfor- mance and power efficient but lacks flexibility. As the wireless standards evolve with time, so the hardware needs to be flexible. The viterbi decoder can be implemented in software and executed on an embedded processor but it will require a lot of clock cycles. The viterbi decoder can be implemented more efficiently in dedicated hardware which will require few clock cycles at the cost of flexibility. The high speed communication systems today requires fast data rates which can only be delivered using dedicated hardware solutions. Different hardware modules like USB, Ethernet, TCP/IP, CRC and CAN protocol are included in modern embedded processors [7], [8], [9] to speedup certain parts of application in areas like signal processing, communication and control systems. This provides effective use of viterbi accelerator in programming systems where a series of viterbi decoding is required to be computed. II. CONVOLUTIONAL ENCODING AND VITERBI DECODING Convolutional encoding of data is implemented with a shift register having K - 1 memory elements and cascaded network of exclusive-or gates. Here K is the constraint length and having 2 K+1 encoder states. The shift register is a chain of flip-flops and the output of nth flip-flop goes as input into the (n+1)th flip-flop. The data in the registers is shifted to the next register and the value in the last register gets discarded. The combinational logic consisting of exclusive-or gates is used to perform modulo-2 addition. The encoder outputs n symbols using generator polynomials and values in the shift register. Fig. 1 shows a convolutional encoder for K =3, R =1/2 and generator polynomials G1 = (1, 1, 1) and G2 = (1, 0, 1). The code rate is the ratio of the number of input bits to the number of output bits (R = m/n). The reason for the convolutional codes being efficient compared to block codes is the fact that every input bit has an impact on K successive output symbols [10]. The value of K is directly proportional to the code complexity and error correction capability. The decoder complexity and memory requirements increases with increasing K. Figure 1: Convolutional Encoder general architecture. Trellis diagram is used to visualize the state transitions of an encoder, as shown in Fig. 2. The black lines represent input bit 0 and the dotted lines represent input bit 1. The trellis path of input sequence is represented by the red lines. The basic concept is that the valid path through trellis diagram is www.ijacsa.thesai.org 402 | P a g e