Journal of VLSI Signal Processing, 4, 147-163 (1992) 1992 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. VLSI Parallel Architecture for Kalman Filter An Algorithm Specific Approach MAGDY A. BAYOUMI The Center of Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, Louisiana 70504 PADMA RAO Cirrus Logic, 3100 West Warren Ave., Fremont, CA 94538 BASSEM ALHALABI The Center of Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, Louisiana 70504 Received May 1, 1991; Revised November 10 16, 1991. Abstract. An algorithm specific architecture for Kalman filter is presented. It is based on systolic arrays. Parallelism has been exploited on both algorithm and architecture levels. Faddeev's algorithm has been employed. The in- volved computation tasks, triangularization and nullification are performed in parallel which leads to a speedup of about 40 %. Throughput has been increased by using bi-trapezoidal arrays. Techniques have been employed for data storage and skewing which enables fast data transfer rates. A VLSI implementation of a prototype of matrix of size 4 • 4 has been discussed. 1. Introduction The continuing improvements in VLSI technology have made it possible to construct high performance and cost-effective Digital Signal Processing (DSP) architec- tures. DSP computation has special features and cri- teria, namely [1]: 1. Processing large set of data. 2. Multiple use of the same data. 3. Intensive computation using few types of operations. 4. Complex data communication of intermediate data. 5. Parallelism can be extracted on the algorithmic level. DSP processors can be classified as general purpose processors (GPP), application specific processors (APSP), or algorithmic specific processors (ALSP). Selecting one of these design approaches is a trade off between flexibility and performance. ALSP is a compro- mise between the generality and average performance of GPP, and specialty the high performance of APSE The scope of this paper is to develop an ALSP for Kalman filter. The paper is composed of two parts: algorithmic analysis and VLSI architecture development. Kalman filter is considered an optimal linear esti- mator for the states of a dynamic system in the least mean squared sense. It forms the basis for a large class of complex signal processing applications such as radar signal processing, target prediction and tracking, and flight estimation of aircraft stability. The Kalman filter algorithm updates the estimates of the states of a dynamic system based on prior estimates and observed measure- ments. The states to be estimated are not ready for measurement. Instead, the measurements of these states along with noise are available. Usually, the states of a dynamic system are varying randomly, (i.e., they are random processes). The Kalman filtering problem can be stated as an estimation of the states of randomly vary- ing processes from noisy measurement. The available knowledge for estimation is about the nature of the involved noises. Unlike other filtering algorithms, Kalman Filter algorithm does not lend itself to a straightforward VLSI implementation because it requires many matrix opera- tions. A typical estimate involves 10 matrix multiplica- tions, 2 inversions, 4 additions, and 1 subtraction. If all these are computed sequentially, it would take 17 matrix operation time steps of order O(n 3) each, where n is the size of the matrix. Systolic arrays are good candi- dates for such intensive matrix computations; they have been proposed for Kalman filter realization [2]-[6], but complete VLSI architectures have not been reported. The techniques adopted for these systolic arrays are complex, as the data needs to be preprocessed before it is fed to the array. This preprocessing requires complex operations such as Cholesky decompositions, square rooting etc. Following such techniques in VLSI tech- nology results in enormous amount of silicon area, slow performance and very complex control mechanisms. One main aspect of this paper is an algorithmic anal- ysis to extract parallelism on that level. We demonstrate