Stereo vision algorithm implementation in FPGA using census transform for effective resource optimization. Mario-Alberto Ibarra-Manzano 1,2 , Dora-Luz Almanza-Ojeda 1,2 , Michel Devy 1,2 , Jean-Louis Boizard 1,2 and Jean-Yves Fourniols 1,2 1 CNRS; LAAS; 7 avenue du Colonel Roche, F-31077 Toulouse, France 2 Universit´ e de Toulouse; UPS, INSA, INP, ISAE; LAAS-CNRS : F-31077 Toulouse, France Emails: {maibarra, dlalmanz, michel, boizard, fourniols}@laas.fr Abstract—In this work, we present the implementation in a reconﬁgurable architecture of a dense stereo vision algo- rithm based on census transform. Analyzing census transform algorithm we found that size and access memory could be reduced, which consequently also reduced the latency time. Furthermore, architecture resources are optimized and efﬁcient thanks to binary operations and integer arithmetic used by census transform directly compatible with the FPGA. Final architecture is able to construct 130 dense disparity maps per second for each corresponding pair of stereo images. A performance analysis, among other three disparity map implementations and our architecture, shows that at the end, we propose a better trade off among performance, latency, logic elements and memory size. The optimization and the resource saving rend our architecture an interesting option to solve the problem of stereo vision in real time, quite used in autonomous navigation. I. I NTRODUCTION The stereo vision process intends to reconstruct the 3D information of a scene from two different images. These images are captured from two cameras that are separate by a previously established distance called baseline. However, in some cases, depth information is unreliable or insufﬁcient, e.g. in mobile robotics or surface recovering for automatic 3D model acquisition. Furthermore, applications such as robotics platforms and autoguided vehicles require real-time perfor- mances that can not be accomplished by conventional com- puters. Different real-time stereo systems have been implemented by custom hardware recently. The parallelism provided by this hardware allows users to meet the performance constraints as well as minimising either the power or cost per unit of the system. Nevertheless, the major problem with hardware implementations is the time and cost required by the design and fabrication stages and the non recoverable costs of produc- tion. Thus, hardware-based stereovision is more appropriate for systems that are either cost-insensitive, where the high development cost is tolerated, or systems that are produced in massive volumes, where the cost is absorbed by the lower cost per unit of the hardware implementation. Field-Programmable Gate Array (FPGA) has enabled the creation of hardware designs in standard, high-volumen parts, thereby amortizing the cost of mask sets and signiﬁcantly reducing time-to-market for hardware solutions. However, engineering cost and design time for FPGA-based solutions still remain signiﬁcantly higher that software-based solutions. Designers must frequently iterate the design process in order to meet system performance requirements while simultaneously minimizing the required size of the FPGA. Each iteration of this process takes hours or days to complete [1]. In this paper, we present an optimized architecture to solve the problem of real-time stereo vision systems for autonomous robotic navigation. We are interested in the approaches of stereo correlation which have a special stress in the devel- opment of reliable algorithms that are suitable for hardware implementations. Such approaches prefer the bit or integer arithmetic which simpliﬁes the ﬁnal architecture. The next section describes passive stereo vision algorithm based on the census transform used to obtain depth information. The details about hardware implementation is given in the section 3. In the section 4, we discuss the performance of our architecture and the tests carry out in the FPGA. Conclusions and perspectives are presented at the end of this document. II. OVERVIEW OF PASSIVE STEREO VISION In computer vision, stereo vision intends to recover depth information by two images of the same scene, a pixel in one image corresponds to a pixel in the other, if both pixels are pro- jected along of sight of the same physical scene element. Also, if the two images are spatially separated but simultaneous, then computing correspondence determines stereo depth [2]. There are two main approaches to process the stereo correlation: feature-based and area-based approaches. In this work, we are rather interested in area-based approaches, because it proposes a dense solution that produces high-density disparity maps. Furthermore, it has extremely regular algorithmic structure and it can be used to propose a convenient hardware architecture. The census transform is the area-based approach chosen to solve the stereo vision problem. The global algorithm of dense stereo vision based on cesus transform is presented in Fig. 1. First of all, the left and right images are processed independently. In order to decrease the