Wide RGB-D for Scaled Layout Reconstruction Alejandro Perez-Yus, Gonzalo Lopez-Nicolas, Jose J. Guerrero One of the most important topics in computer vision and robotics has always been to perceive the 3D information from the scene. The advent of consumer RGB-D cameras has caused a great positive impact in the ﬁeld. Unfortunately, these devices usually have a ﬁeld of view (FOV) too narrow for certain applications, and it is necessary to move the camera to capture different views of the scene. Our goal is to be able to reconstruct the structure of the scene with scale in one single shot. To achieve this goal we propose to use a color camera with wide FOV to extend the depth information in a novel hybrid camera conﬁguration composed by a depth and a ﬁsheye camera (Fig. 1a). Once the cameras are calibrated [1], the system is capable of viewing over a 180 ◦ of color information where the central part of the image has also depth data (Fig. 1b). To our knowledge, this is the ﬁrst time this conﬁguration has been used, although the interest in such sensor pairing is clear in new devices e.g. Google Tango. In particular, we propose to extend the 3D information in one single shot via spatial layout estimation. Our layout estimation method is based on line segments from the ﬁsheye image, and provides scaled solutions rooted on the seed depth information. As a result, a ﬁnal 3D scene reconstruction is provided (see Fig. 1c). The 3D room layout can be seam- lessly merged with the original depth information to generate a 3D image with the periphery providing an estimation of the spatial context to the central part of the image, where the depth is known with good certainty. The collaboration between cameras is bidirectional, since the extension of the scene layout to the periphery is performed with the ﬁsheye, but the depth information is used both to enhance the layout estimation algorithm and to scale the solution. A scheme of the whole algorithm is shown in Fig. 2. In detail, the depth camera provides a region of the image with 3D data, from which an initial estimate of the Vanishing Points (VPs) and 3D planes can be recovered. We assume scenes are from a Manhattan World [2], and the VPs are used to retrieve the scene orientation to generate layout proposals. The 3D planes extracted are used to ﬁnd the ﬂoor and provide scale, impossible to get otherwise with one single shot and no previous knowledge of the scene. Having scale has many advantages in this type of methods which usually have many heuristics. For instance, when tuning parameters Instituto de Investigaci´ on en Ingenier´ ıa de Arag´ on (I3A), Univer- sidad de Zaragoza, 50018, Zaragoza, Spain. alperez@unizar.es, gonlopez@unizar.es, josechu.guerrero@unizar.es This work was supported by the projects DPI2014-61792-EXP and DPI2015-65962-R (MINECO/FEDER, UE), the grant BES-2013-065834 (MINECO). RGB-D camera Fisheye camera (a) (b) (c) Fig. 1: (a) Fields of view of our proposed system composed by a Fisheye and a RGB-D camera. (b) The depth informa- tion in the center is extended to the periphery combining information with the line segments that we use to extract the spatial layout of the scene. (c) At the end we obtain a 3D reconstruction of the scene with scale. their values can be grounded in reality with real measurement units. Depth information is also used to ﬁlter hypotheses and reward line segments corresponding to planar intersections. The line segments from the wide image are classiﬁed according to the the three Manhattan directions. The hor- izontal lines are projected either to the ﬂoor plane or the estimated ceiling plane to have the 3D segment position in the real world. Structural corner candidates are then looked for, by considering plausible and simple cases of line distribution. These corners are evaluated by our scoring func- tion, so layout hypotheses are proposed by the probability of these corners to occur in the real world. Then, layout hypotheses are generated based on geometrically coherent wall distributions that do not contradict the initial depth information and the observable segments. The algorithm is able to work even under high clutter circumstances due to the combination of lines from both ﬂoor and ceiling (because of using a large FOV camera), but also because of our generation of Manhattan hypotheses that can estimate hidden corners to complete the layout. For the evaluation stage we