TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation Xiangyun Meng 1 , Nathan Hatch 1 , Alexander Lambert 1 , Anqi Li 1 , Nolan Wagener 2 , Matthew Schmittle 1 , JoonHo Lee 1 , Wentao Yuan 1 , Zoey Chen 1 , Samuel Deng 1 , Greg Okopal 1 , Dieter Fox 1 , Byron Boots 1 , Amirreza Shaban 1 1 University of Washington 2 Georgia Institute of Technology https://sites.google.com/view/visual-terrain-modeling Abstract—Effective use of camera-based vision systems is essential for robust performance in autonomous off-road driving, particularly in the high-speed regime. Despite success in struc- tured, on-road settings, current end-to-end approaches for scene prediction have yet to be successfully adapted for complex outdoor terrain. To this end, we present TerrainNet, a vision-based terrain perception system for semantic and geometric terrain prediction for aggressive, off-road navigation. The approach relies on several key insights and practical considerations for achieving reliable terrain modeling. The network includes a multi-headed output representation to capture ﬁne- and coarse-grained terrain features necessary for estimating traversability. Accurate depth estimation is achieved using self-supervised depth completion with multi-view RGB and stereo inputs. Requirements for real-time performance and fast inference speeds are met using efﬁcient, learned image feature projections. Furthermore, the model is trained on a large- scale, real-world off-road dataset collected across a variety of diverse outdoor environments. We show how TerrainNet can also be used for costmap prediction and provide a detailed framework for integration into a planning module. We demonstrate the performance of TerrainNet through extensive comparison to current state-of-the-art baselines for camera-only scene prediction. Finally, we showcase the effectiveness of integrating TerrainNet within a complete autonomous-driving stack by conducting a real-world vehicle test in a challenging off-road scenario. I. I NTRODUCTION Autonomous robot navigation in off-road environments has seen a wide range of applications including search and rescue [50], agriculture [13], planetary exploration [48, 51], and defense [28]. Unlike indoor or on-road environments where traversable areas and non-traversable areas are clearly separated, off-road terrains exhibit a wide range of traversability that require a comprehensive understanding of the semantics and geometry of the terrain (Figure 1). Current off-road navigation systems typically rely on LiDAR to obtain a 3D point cloud of the environment for semantic and geometric analysis [19, 33, 44, 45, 46]. While LiDAR sensors provide accurate spatial information, the resulting point cloud is rather sparse, making it tricky to build a complete map of the environment. Though point cloud aggregation can build such a map, it faces challenges when the vehicle travels at high speeds [20]. Finally, since LiDAR emits lasers into the environment, dust and snow can interfere with the measurement, and outside observers can detect the vehicle from the emitted lasers. Distribution Statement A. Approved for Public Release, Distribution Unlimited. Fig. 1: High-speed driving in complex off-road environments requires joint reasoning of terrain semantics and geometry. Top row: a vehicle can drive at high-speed on a dirt road but has to be more cautious in snow due to wheel slipping. Bottom row: a vehicle needs to estimate terrain slopes and sizes of vegetation for safe planning and control. Cameras, on the other hand, provide a number of beneﬁts over LiDAR. Cameras provide high-resolution semantic and geometric information, stealth due to their passive sensing nature, are less affected by dust and snow, and are considerably cheaper. Hence, a camera-only off-road terrain perception system can potentially reduce the hardware cost, improve the system robustness at high speeds, and open up new possibilities for off-road navigation under extreme weather conditions and where stealth is desired. Perhaps unsurprisingly, similar motivations have spurred recent major efforts of camera-only perception for on-road navigation [11, 22, 31, 35, 55]. This task mainly focuses on Bird’s Eye View (BEV) semantic segmentation to assess trafﬁc conditions. One notable work is Lift-Splat-Shoot (LSS) [35]. The core of LSS consists of a “lift” operation that predicts a categorical distribution over depth for each pixel and a “splat” operation to fuse the image features and project them to the BEV space. LSS and related work are entirely data-driven, so they can predict complete maps and are more robust to sensor noise and projection errors. But their applicability to off-road perception is challenged by several barriers. First and foremost, they only predict a ground semantic BEV map without any 3D arXiv:2303.15771v3 [cs.RO] 29 May 2023