IEEE TRANSACTIONS ON COMPUTERS, VOL. zyxwvutsrqpo 43, NO. 12, DECEMBER 1994 1429 zyx Constant Time Inner Product and Matrix Computations on Permutation Network Processors Ming-Bo Lin and A. Yavuz Oruq Abstract-Inner product and matrix operations find extensive use in algebraic computations. In this brief contribution, we introduce a new parallel computation model, called a permutation network processor, zyxwvut to carry out these computations efficiently. Unlike the traditional parallel computer architectures, computations on this model are carried out by composing permutations on permutation networks. We show that the sum of zyxwvutsrqponml N algebraic numbers on this model can be computed in zyxwvutsrq O(1) time using zyxwvutsrqponml N processors. We further show that the inner product and matrix multiplication can both be computed on this model in O(1) time at the cost of O(.V) and 0(N3), respectively, for N element vectors, and zyxwvuts ,V x N matrices. These results compare well with the time and cost complexities of other high level parallel computer models such as PRAM and CRCW PRAM. zyxwvutsrqpon Index Terms-Complex inner product, complex matrix multiplication, permutation networks, real inner product, and real matrix multiplication. I. INTRODUCTION Inner product and matrix operations form the core of computations of vector and array processors and signal and image processing algorithms. Traditional architectures for carrying out such operations are based on reducing vector computations into scalar operations such as binary addition and multiplication [13], [14]. As a result, much of the computations in vec:tor and array processors is handled by conventional arithmetic circuits such as carry look ahead adders, recoded and cellular array multipliers and dividers [4]. While these conventional circuits are optimized for speed and hardware, they still rely on a variety of building blmocks such as adder, subtractor and multiplier cells which often lead to nonuniform arithmetic circuits for vector processors. In this brief contribution, we propose a new concept to cany out vector and matrix computations. Unlike the traditional archi- tectures, this concept is based on coding not only the operands but also the operations over the operands in such a way that a vector or matrix computation reduces to composing permuta- tion maps. Each operand is coded into a permutation and addition or multiplication of two operands is carried out by composing the permutations that correspond to these operands on a permu- tation network. As a result, bo,th addition and multiplication are reduced to a single computation, Le., that of composing permuta- tions. In addition, any other computation involving addition, sub- traction and multiplication operations are also reduced to composing permutations. We show that, on this new computation model, called a permutation network processor, the sum of zyxwvutsrq il’ n-bit numbers, the inner product of two vectors, each containing LV n-bit elements, and the multiplication Manuscript received July 31, 199:2; revised June 5, 1993 and September 13. 1993. This work was supported in part by the Ministry of Education, Taipei, Taiwan, Republic of China aind in part by the Minta Martin Fund of the School of Engineering at the University of Maryland. M.-B. Lin is with the Electronic Engineering Department, National Taiwan Institute of Technology, 43, Keelung Road Section 4, Taipei, Taiwan. A. Y. Oruq is with the Electrical Engineering Department, Institute of Advanced Computer Studies, University of Maryland, College Park, MD 20742-3025 USA; e-mail: yavuz@eng.umd.edu. lEEE Log Number 9404362. of two N x N matrices with n-bit entries can all be computed in O( 1) steps. The first two computations require O(N) processors and the matrix multiplication requires ( N3) processors, where each processor handles an n-bit input, and has O((n + 1gN)’) bit-level cost and O(n + IgN) bit-level delay. We note that these results compare well with the complexi- ties for the same computations on other models. For example, on a PRAM model [I, 51, all three computations take O(1gN) time with the same numbers of processors, where each processor has two O(n + IgN)-bit inputs, and uses arithmetic circuits with O(nZ + 1gN) bit-level cost. On a cube-connected parallel com- puter, the same three computations also take O(1gN) time with the same numbers of processors and with the same processor bit- level complexity [I]. On the combining CRCW PRAM, the same three computations can all be done in 0(1) time and with the same numbers of processors, where each processor has two O(n + IgN)-bit inputs, and with O(n2 + 1gN) bit-level cost. In addition, this model must have a circuit to combine up to zyx N concurrent writes. We also note that, even though the permutation network processor model stands on its own, it ties with some earlier computation models that were reported in the literature. One such model, called a processing network, was given in [I21 where a mesh of pro- cessing elements was used to compute certain algebraic formu- las. The processing elements in this model can be programmed for arithmetic and routing functions whose combinations lead to various algebraic expressions on the mesh topology. The main difference between this model and the permutation network pro- cessor model is that the latter does not rely on an explicit use of adder or multiplier circuits; rather it combines them together using shift permutations. More recently, a new parallel computer model, called a reconfigurable bus system, has been introduced to solve a wide range of problems including sorting problems [15], graph problems [9], [16], and string problems [2]. All these problems have been shown to be solvable in O(1) time on the reconfigurable bus system model. As in the processing network model, processors are connected in this model by some fixed topology such as the mesh, and each processor can be programmed for some data processing as well as routing functions. It is assumed that the signals can be broadcast between processors in constant time regardless of how far the broadcast is carried [SI, [15], [16]. The essence of this assumption is that once the processors are simultaneously programmed for some routing functions, the signals that pass through them only encounter a propagation delay which is short enough so as to be considered a constant. The same assumption also holds for our model. Again, the main difference between this model and the permutation network processor is that the latter relies only on permutation maps while the former al- lows its processors to perform both data processing and routing functions. Finally, we should note that all computations described in the brief contribution are carried out modulo zyxwv N. In the case that J?i is not a power of 2 (which is typically the case because of coprimality contstraints), the results should be converted to binary and this will exact additional time and hardware cost. Also, if the operands are given in binary they must be encoded before they can be computed on. Our complexity expressions do not include these additional encoding and decoding time and cost. The time and hardware complexities of encoding and decoding steps are given in [6], [7] and will be published elsewhere. 0018-9340/94$04.00 0 1994 IEEE Authorized licensed use limited to: University of Maryland College Park. Downloaded on January 30, 2009 at 14:19 from IEEE Xplore. Restrictions apply.