Projective Factorization of Multiple Rigid-Body Motions Ting Li † Vinutha Kallem ‡ Dheeraj Singaraju † Ren´ e Vidal † † Center for Imaging Science and ‡ Department of Mechanical Engineering, Johns Hopkins University 308B Clark Hall, 3400 N Charles St., Baltimore MD 21218, USA http://www.vision.jhu.edu/ Abstract Given point correspondences in multiple perspective views of a scene containing multiple rigid-body motions, we present an algorithm for segmenting the correspondences according to the multiple motions. We exploit the fact that when the depths of the points are known, the point trajec- tories associated with a single motion live in a subspace of dimension at most four. Thus motion segmentation with known depths can be achieved by methods of subspace sep- aration, such as GPCA or LSA. When the depths are un- known, we proceed iteratively. Given the segmentation, we compute the depths using standard techniques. Given the depths, we use GPCA or LSA to segment the scene into multiple motions. Experiments on the Hopkins155 database show that our method outperforms existing affine methods in terms of segmentation error and execution time. Our methods achieves an error of 2.5% on 155 sequences. 1. Introduction The ability to extract scene geometry and motion is crit- ical to many applications in computer vision, such as image based rendering, 3D localization and mapping, mosaicing, etc. Often, only a video sequence of the scene is available, with no prior knowledge about its structure or motion. This has motivated the following problem in computer vision: Given multiple images taken by a rigidly moving camera observing a static scene, recover camera motion and scene structure from point correspondences in multiple views. When the projection model is affine, this problem can be solved via direct factorization of the matrix of point cor- respondences W = MS ⊤ into its motion and structure com- ponents M and S, respectively [12]. In the case of perspec- tive cameras, the depths λ of the point correspondences are not known, thus the matrix of point correspondences W(λ) cannot be directly factorized. Algebraic methods proceed by algebraically eliminating the depths, solving for camera motion using two-view and three-view geometry, and com- puting the depths using triangulation [4]. However, these methods have difficulties handling all views simultaneously. The Sturm/Triggs (ST) algorithm [11] obtains an initial estimate of the depths λ using two-view geometry. Then the matrix W(λ) containing point correspondences in all views, is factorized into the motion and structure of the scene. Simple iterative extensions to the ST algorithm (SIESTA) [14, 4] alternate between the estimation of motion + struc- ture and the estimation of the depths. Unfortunately, with- out proper initialization, SIESTA can converge to a trivial solution where all the depths are zero. [8] proposed an ex- tension of SIESTA that incorporates additional constraints into the optimization problem. However, [9] showed that in spite of such constraints, SIESTA can still result in trivial solutions. To overcome these issues, [9] proposed a prov- ably convergent method called CIESTA which uses regular- ization in the optimization to adjust the estimated depths so as to keep them close to their correct values. Over the past few years, there has been an increasing interest on extending motion estimation methods to scenes with multiple motions. This requires one to group the points according to the different motions before applying standard motion estimation techniques to each group. In computer vision, this is known as the motion segmentation problem: Given point trajectories corresponding to n objects un- dergoing n different rigid-body motions relative to the cam- era, cluster the trajectories according to the n motions. This problem has been addressed mostly under the as- sumption of an affine camera model, where-in the trajecto- ries associated with each motion live in a linear subspace of dimension four or less [12]. This subspace constraint was used by Costeira and Kanade (CK) [1] to propose a multi- frame 3-D motion segmentation algorithm based on thresh- olding the entries of the so-called shape interaction matrix Q. This matrix is built from the singular value decomposi- tion (SVD) of the matrix of point trajectories W, and has the property that Q ij =0 when points i and j correspond to independent motions. However, this thresholding process is very sensitive to noise [2, 5]. Kanatani scales the entries of Q using the geometric Akaike’s information criterion for linear [5] and affine [6] subspaces. Gear [2] uses bipartite graph matching to threshold the entries of the row echelon 1