Robust Camera Self-Calibration from Monocular Images of Manhattan Worlds Horst Wildenauer, Allan Hanbury Vienna University of Technlology Institute of Software Technology and Interactive Systems horst.wildenauer@gmail.com, hanbury@ifs.tuwien.ac.at Abstract We focus on the detection of orthogonal vanishing points using line segments extracted from a single view, and using these for camera self-calibration. Recent methods view this problem as a two-stage process. Vanishing points are ex- tracted through line segment clustering and subsequently likely orthogonal candidates are selected for calibration. Unfortunately, such an approach is easily distracted by the presence of clutter. Furthermore, geometric constraints im- posed by the camera and scene orthogonality are not en- forced during detection, leading to inaccurate results which are often inadmissible for calibration. To overcome these limitations, we present a RANSAC-based approach using a minimal solution for estimating three orthogonal vanishing points and focal length from a set of four lines, aligned with either two or three orthogonal directions. In addition, we propose to refine the estimates using an efficient and robust Maximum Likelihood Estimator. Extensive experiments on standard datasets show that our contributions result in sig- nificant improvements over the state-of-the-art. 1. Introduction In their seminal work, Coughlan & Yuille [2] pointed out that imagery of man-made environments can be often characterized by a predominance of orthogonal structures, coining the name Manhattan world. Such orthogonal struc- tures provide invaluable cues about a camera’s orientation w.r.t. the world coordinate frame and its internal parame- ters (mostly the focal length). This information is usually extracted from three finite, mutually orthogonal vanishing points [1]. The detection of vanishing points has been ap- plied to scene understanding and single view reconstruction of indoor scenes [6, 10], architecture reconstruction [17], detection of rectangular structures [9, 12] and multi-view stereo [4]. 1.1. Related Work & Contributions We concern ourselves with the detection of orthogonal vanishing points using sets of line segments extracted from a single, uncalibrated view. In [18], Rother suggests an ex- haustive search over vanishing point hypotheses obtained from all possible line intersections to find dominant orthog- onal directions and the most plausible camera parameters. In practice Rother’s method suffers from high computa- tional cost, which other methods try to reduce by a two- step process. First, vanishing points are estimated from concurrent line segments either through iterative procedures [14, 15] or by simultaneous clustering [8, 20]. Then, from a plausible orthogonal vanishing point triplet the camera cal- ibration is estimated [1, 7, 11]. Iterative methods perform RANSAC-based clustering of line segments, making use of line intersections to generate vanishing point hypotheses. After the vanishing point with maximal support is found, its consensus set is removed and the procedure is repeated in search of remaining vanish- ing points. This approach suffers from severe limitations: (a) Line segments are often compatible with more than one vanishing point. Depending on spatial tolerance, they can be either wrongly fused into one cluster, or multiple de- tections for one vanishing point occur. (b) The vanishing points are in general not compliant with constraints imposed by the camera and the scene orthogonality, causing inaccu- racies or complete failure in calibration. Recently, Tardif [20] attacked the first problem, us- ing a robust clustering technique specifically designed for treatment of multiple models. Among others, Kosecka & Zhang [8] suggested simultaneous optimization of vanish- ing points using Expectation-Maximization. However, or- thogonality and camera constraints are not enforced and it is not clear if their initialization based on line segment ori- entation finds all relevant vanishing points. Our RANSAC-based approach addresses both limita- tions in a unified framework. We explicitly exploit orthogo- nality and camera constraints during hypotheses generation, and thereby make better use of the available data. After the RANSAC stage the quality of the results can be refined, for 1