An edited version of this work was publiched in IEEE TRANS. ON CIRCUITS AND SYSTEMS–II, VOL. 62, NO. 9, SEPT 2015 DOI:10.1109/TCSII.2015.2435753 1 High-Throughput FPGA Implementation of QR Decomposition Sergio D. Mu˜ noz and Javier Hormigo Abstract—This brief presents a hardware design to achieve high-throughput QR decomposition, using Givens Rotation Method. It utilizes a new two-dimensional systolic array architec- ture with pipelined processing elements, which are based on the COordinate Rotation DIgital Computer (CORDIC) algorithm. CORDIC computes vector rotations through shifts and additions. This approach allows a continuous computation of QR factoriza- tions with simple hardware. A ﬁxed-point FPGA architecture for 4 × 4 matrices has been optimized by balancing the number of CORDIC iterations with the ﬁnal error. As a result, compared to other previous proposals for FPGA, our design achieves at least 50% more throughput, and much less resource utilization. Index Terms—QR Decomposition, systolic array, pipelined, FPGA, high-throughput, CORDIC I. I NTRODUCTION M OST of the advanced signal processing algorithms are based on algebraic matrix operations. Many examples of this are found in wireless communication, such as multiple- input-multiple-output (MIMO), beam-forming, multi-user de- tection and cancellation, etc [1]. One useful operator for these matrix operations is QR factorization, especially for MIMO technologies [2] [3] and adaptive ﬁltering [4]. Some of this applications require high-throughput QR decomposition but are for small matrix sizes. Thus, many works have addressed the parallel hardware implementation of this operation for either ASIC or FPGA technologies. In this work, we focus on high-throughput computation for small matrices on FPGAs. The Givens Rotation Method (and its variations) is probably the most widely used to implement QR decomposition by hardware due to its robust numerical properties and its easy parallelization [5]. In the literature, there are several papers in which QR factorization has been implemented on FPGA by using this method. Although, serial approaches or linear systolic arrays may be used [6], to achieve high throughput, the most common hardware implementation is through two- dimension (2D) systolic arrays, such as in [7], [8], [2], [9], [10], [11]. A 2D systolic array is a parallel grid structure where processing elements (PEs) works in parallel and are locally interconnected. This systolic architecture allows the exploitation of different grades of parallelism inherent to the the Given Rotation algorithm. Thus, these approaches have This work was supported in part by the Ministry of Education and Science of Spain and Junta of Andaluc´ ıa under contracts TIN2013-42253-P and P07-TIC-02630, respectively. The authors are with the Department of Computer Architecture, Uni- versidad de M´ alaga, M´ alaga E-29071 Spain (e-mail: smunoz@uma.es; fjhormigo@uma.es). high-throughput and relatively low latency, at the cost of considerable area consumption. In this work, through combining several ideas, we have designed a new architecture which improves previous high- throughput FPGA implementations. It is based on the CORDIC algorithm to simplify hardware, pipelining the PEs to obtain better throughput, along with a different schedule for performing the Given Rotations to reduce latency. As a result, the proposed architecture has very high-throughput and low latency, with a relatively reduced area consumption. They also have a very simple control and communication logic. The next sections of this brief are organized as followed: Section II reviews some important aspects of the QR decom- position using Givens Rotations, along with a brief review of some previous works proposed in the literature. Section III presents the proposed architecture to achieve high-throughput. In Section IV the results of the FPGA implementation are studied and compared with other previous works. Finally, Section V provides the conclusions of this work. II. GIVENS ALGORITHM AND PREVIOUS FPGA IMPLEMENTATIONS Given a matrix A m×n , this is equivalent to the product of two factors, i. e. A = Q · R, in which matrix Q m×m is orthogonal and R m×n is an upper triangular matrix [5]. The computation of these two factors is called QR decomposition or factorization. The Givens Method achieves a QR factorization through unitary transformations, called Givens Rotations, which se- lectively allow the introducing of a zero element [5]. Givens rotation matrix has rank-two corrections about identity matrix, where the rank (i, j ) is replaced by orthogonal values based on sines and cosines.  cos(θ) sin(θ) -sin(θ) cos(θ)  ×  a 1 a 2  =  a ′ 1 0  (1) As an example, a Givens rotation is represented in Eq. 1 for a 2 × 1 matrix, where the resultant matrix has a new inserted zero; this can be extrapolated to any other matrix size. The rotation angle θ must be computed beforehand by the formula arctan( a2 a1 ). Alternatively, these values can also be calculated by Eq. 2 and Eq. 3. cos(θ)= a i,k  a 2 i,k + a 2 j,k (2) sin(θ)= -a j,k  a 2 i,k + a 2 j,k (3) Accordingly, Givens Method algorithm starts zeroing the lower elements, from the ﬁrst column to the last one, and, on each column, starting from the bottommost element to the Copyright c 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org