Real-Time Localization and Dense Mapping in Underwater Environments from a Monocular Sequence Alejo Concha * , Paulo Drews-Jr †‡ , Mario Campos ‡ and Javier Civera * * I3A - Universidad de Zaragoza, Spain alejocb,jcivera@unizar.es † NAUTEC - Universidade Federal do Rio Grande, Brazil paulodrews@furg.br ‡ VeRLab - Universidade Federal de Minas Gerais, Brazil mario@dcc.ufmg.br Abstract—In this paper we present an algorithm that estimates in real-time a 3D dense reconstruction of an underwater scene and the vehicle pose, being the only input a monocular image sequence. Our algorithm selects a set of keyframes from a seabed sequence and estimates a depth for every pixel from the information contained in the images using direct mapping methods. The procedure does not require extra sensing input or assumptions about the scene. Our experimental results in a pool and a seabed sequence show that such minimal sensing conﬁguration can achieve a high degree of accuracy. I. I NTRODUCTION The accurate tridimensional reconstruction of underwater scenes is an active area of research having three main impor- tant areas of application; namely the autonomous navigation of underwater robots [1], the registration of seabed natural environments for its posterior study [2], [3], and the inspection of underwater structures (e.g., marinas, ship hulls or pipelines) for assessment and maintenance [4], [5]. The basic algorithms come from the robotic ﬁeld called SLAM [6], standing for Simultaneous Localization and Mapping, that aims to estimate the robot pose and a 3D geometric map of the scene from sensor data. The rapid attenuation of the electromagnetic signals in aquatic medium constraints the sensing possibilities for marine robots, e.g., neither GPS nor LIDAR can be used. [7], [8] are two recent surveys on the topic of underwater localization, mapping and navigation. Sonar has been successfully used in structured marina-like environments [1], [9]; but it captures limited information of the environment with low accuracy. In underwater images, the light suffers absorption and scat- tering by the medium before reaches the camera. It generated an effect called haze. Basically, haze becomes a serious issue since it reduces the overall contrast in images and causes color shift, directly impacting on the reduction in the visibility. Besides these limitation, vision stands out as an important alternative in most applications due to its low cost, rich information in short range and high frame rate. The existing research in underwater visual SLAM has used predominantly stereo cameras; from the early approach of [10] to more recent ones showing large reconstructions, e.g. [11], [12], [13]. In most of the cases, visual sensors are fused with inertial measurements, Doppler velocities or depth pressure sensors [14], [15]. Such requirement limits the applicability of the algorithms, as all of these sensors are only available in large and expensive vehicles. Another key limitation of these traditional methods is that they use feature-based reconstruction methods [16], [17], meaning that they can only reconstruct a sparse set of salient image points. These methods are able to estimate the camera pose very accurately; but the sparseness of the estimated maps make them inappropriate for autonomous robotic navigation. Dense reconstructions can be built on top of these sparse point clouds via triangulation [13]. The assumption there is that low-gradient areas between salient points are planar; leading to inaccurate results if the density of salient points in the image is low. The recent work [18] uses a region-growing algorithm to expand a feature-based reconstruction into a more dense one. Typically both methods are computationally expensive, making them unsuitable for online robot navigation. Our main contribution is the use of direct monocular SLAM methods [19], [20] that achieve real-time and dense –one point per image pixel– 3D reconstructions from the only input of a monocular sequence. Notice that our proposal overcomes the two limitations mentioned in the above para- graphs. We use a minimal low-cost sensor conﬁguration of one camera, suitable for small vehicles. And we achieve dense, one-point-per-pixel 3D reconstructions without relying on any extra assumptions. This technique opens new opportunities to exploration of the benthic areas using cheap and small vehicles. The rest of the paper is organized as follows. Section II describes a classiﬁcation algorithm that rejects hazy image regions. Section III describes the direct SLAM algorithm. Section IV shows the experimental results and section V gives the conclusions. II. HAZE CLASSIFICATION As we use a forward-looking camera that might be imaging scenes at a large depth, part of the image might be hazy and useless for a reconstruction algorithm. We use a SVM- classiﬁcation scheme in order to identify such hazy areas. First, we segment the image I into a set of superpixels Ω = {S 1 ,..., S i ,...}. Superpixels are image regions of homogeneous color. In this work, we use the superpixel segmentation proposed in [21]. See ﬁgures 1(a) and 1(b) for an example of an underwater image and its segmentation into superpixels.