62 2017 IEEE International Solid-State Circuits Conference ISSCC 2017 / SESSION 3 / DIGITAL PROCESSORS / 3.7 3.7 A 1920×1080 30fps 2.3TOPS/W Stereo-Depth Processor for Robust Autonomous Navigation Ziyun Li, Qing Dong, Mehdi Saligane, Benjamin Kempke, Shijia Yang, Zhengya Zhang, Ronald Dreslinski, Dennis Sylvester, David Blaauw, Hun Seok Kim University of Michigan, Ann Arbor, MI Precise depth estimation is a key kernel function to realizing autonomous navigation on micro-aerial vehicles (MAVs). The state-of-the-art semi-global matching (SGM) algorithm has become favored for its high accuracy. In particular, it effectively handles low texture regions due to its global optimization of the disparity between a left and right image over the entire frame. However, SGM involves massively parallel computation (~2TOP/s) and extremely high bandwidth memory access (38.6Tb/s) for 30fps HD resolution. This leads to ~20s runtime for an HD image pair on a 3GHz CPU [1] requiring ~386MB memory and >35W power consumption. Together, these factors place it well outside the realm of MAVs. Prior ASIC implementations have used either simpler local methods [2] or aggressively truncated global algorithms [3] that produce a depth map with significantly inferior quality or limited disparity range (32 or 64 pixels) and therefore fail to support standard automotive scene benchmarks [2-5]. In addition, due to the high memory requirement of SGM, prior methods [3-4] have used external DRAM to store intermediate computation, significantly reducing performance and efficiency. This paper presents a stereo vision processor that fully implements the SGM algorithm on a single chip. The design uses a new image-scanning stride to enable a deeply pipelined implementation with ultra-wide (1612b) custom SRAM for 1.64Tb/s on-chip access bandwidth. Our design is the first ASIC to report performance under the industrial standard KITTI benchmark that renders realistic automobile scenes. The proposed design supports 512-level depth resolution on full HD (1920×1080) resolution with real-time 30fps, consuming 836mW from a 0.75V supply in 40nm CMOS. We also integrate the stereo chip with a quadcopter and demonstrate its operation in real-time flight. Figure 3.7.1 (top right) visualizes the output difference between the local sum of absolute difference (SAD) algorithm and SGM, clearly illustrating the higher quality of SGM. To make a single-chip SGM implementation feasible, i.e. remove the need for external DRAM, we first observe that inter-pixel correlation diminishes when pixel pairs are more than 50 pixels apart. Hence, the proposed design processes the input image in units of 50×50 overlapping pixel blocks. Adjacent blocks are overlapped by 8 pixels to allow cost aggregation across block boundaries. This technique reduces the memory requirement for storing intermediate aggregation results by 95.4%. Fig. 3.7.1 shows a side-by-side comparison of this block-based SGM and the original SGM, which are almost identical. Fig. 3.7.1 also presents quantitative results evaluated on 194 KITTI test cases showing only 0.5% accuracy degradation. As shown in Fig. 3.7.2, the processor streams left and right image blocks into two on-chip interleaved image buffers (30Kb each). It then performs a 7×7 census transformation on each pixel using its surrounding pixels and compares each census-transformed pixel on the left image with census-transformed pixels on the right image at 128 different disparity locations. This produces 128 Hamming distances (6b each) for each pixel that represent the ‘local’ matching cost for the 128 disparities. The processor then aggregates the local matching costs (separately for each disparity) along 8 paths over the 50×50 block. By searching the sum of the aggregated cost for each disparity for the minimum value, the processor obtains the coarse (integer) SGM depth output. It then refines this depth precision by performing a quadratic fitting on three aggregated costs around the minimum using a look-up table to provide sub-pixel depth accuracy. Conventionally, SGM is implemented with a forward and a backward raster scan, with each scan performing aggregation along 4 paths (total of 8 paths). However, following this conventional raster scan order results in a data dependency where the previous pixel must complete its computation before the current pixel can be aggregated (Fig. 3.7.3, left). This dependency dominates the critical path, limiting the clock frequency and voltage scalability for low power operation. We therefore propose a dependency-resolving scan in which pixel processing proceeds diagonally (Fig. 3.7.3, right). When a pixel (F) is fetched into the pipeline, the aggregated costs of all previous pixels (light gray and dark gray) are already computed and stored in high-bandwidth custom SRAMs. This mechanism enables aggressive pipelining, yielding a 3× performance gain. As shown in the block diagram in Fig. 3.7.4 (top), our design leverages parallelism in cost aggregation by running 4 paths in parallel on 4 aggregation units, with each aggregation unit containing 128 processing elements and 512 selection units, resulting in 1.882TOP/s. Figure 3.7.4 (bottom) shows the proposed architecture of the customized compact high-bandwidth SRAM. In the proposed design, the row buffers are read and written simultaneously at 170MHz, and all 128 previous aggregated costs are accessed in a single cycle. This approach achieves the required memory bandwidth of 1.64Tb/s for the 3 row buffers accessed in parallel. This bandwidth would incur large chip area and power overhead if realized with compiled SRAMs. To provide an efficient area/power solution, we use a custom high-bandwidth SRAM that leverages the design’s highly parallelized structure in which each bank has only 50 words with a single word size of 403b (Fig. 3.7.4). All four banks in one SRAM are read and written concurrently, realizing a 1612b dual port access. To reduce leakage power in the 40nm technology, the custom 8T memory bitcell uses HVT transistors. Unlike conventional 8T cells, the read transistor stack is flipped such that the read transistor is not connected to RBL, reducing coupling between RWL and the short, low capacitance RBL. Skewed inverters are used in place of conventional sense amplifiers (1612 per SRAM), reducing sense amp overhead by 2.8×. Overall, each 80Kb SRAM consumes 6mW with 548.1Gb/s bandwidth. The vision processor is fabricated in 40nm GP CMOS. Fig. 3.7.5 shows the measurement setup and real-time demonstration platform mounted on a quadcopter. Real-time image streams captured by the stereo camera are rectified, block-partitioned by a Samsung Exynos-5422 processor on the ODRIOD-XU4 board, and then transmitted to the stereo processor through a USB3.0 interface. The processed real-time depth and confidence maps provide feedback to the Exynos processor through another USB3.0 channel. At 0.9V nominal voltage, the real-time VGA (HD) frame processing latency of the stereo processor is 4.1ms (26ms). In a KITTI automobile scene (Fig. 3.7.5, bottom) and a quadcopter scene ‘on-the-fly’ captured by our demonstration platform (Fig. 3.7.5, middle), large (>100 pixels) disparity frequently occurs, and the proposed processor is able to generate an accurate depth map over the entire image due to its 512 levels of resolution. Fig. 3.7.6 shows the voltage and frequency scaling of the chip and provides comparison with prior work. The proposed processor achieves 7% outlier accuracy on KITTI and an 8× improvement in disparity range compared with [2-5]. Note that [2-5] all lack a standard benchmark evaluation because of their limited depth range. Our system consumes 836mW to process 30fps full HD images at 0.0262nJ ‘normalized energy’ (an FoM proposed in [4] and defined in Fig. 3.7.6 top, left), marking a 5.8× improvement over listed prior work. Power reduces to 55mW for VGA images at 30fps, yielding 0.0117nJ normalized energy. Fig. 3.7.7 shows the die photo and a performance summary. Acknowledgements: We thank TSMC University Shuttle Program for chip fabrication. References: [1] H. Hirschmuller, et al., “Accurate and Efficient Stereo Processing by Semi- Global Matching and Mutual,” Computer Vision and Pattern Recognition, pp. 807-814, 2005. [2] M. Hariyama, et al., “VLSI Processor For Reliable Stereo Matching Based on Window-Parallel Logic-In-Memory Architecture,” IEEE Symp. VLSI Circuits, pp. 166-169, 2004. [3] K. Lee, et al., “A 502GOPS and 0.984mW Dual-Mode ADAS SoC with RNN- FIS Engine for Intention Prediction in Automotive Black-Box System,” ISSCC, pp. 256-257, 2016. [4] H-H. Chen, et al., “A 1920×1080 30fps 611mW Five-View Depth-Estimation Processor for Light-Field Applications,” ISSCC, pp. 422-423, 2015. [5] J. Park, et al., “A 30fps Stereo Matching Processor Based on Belief Propagation with Disparity-Parallel PE Array Architecture,” ISCAS, pp. 453-454, 2010. 978-1-5090-3758-2/17/$31.00 ©2017 IEEE