IEEE TRANSACTIONS ON IMAGE PROCESSING SPECIAL ISSUE ON DISTRIBUTED CAMERA NETWORKS , VOL. ?, NO. ?, ? 2010 1 3D Target-based Distributed Smart Camera Network Localization John Kassebaum, Member, IEEE, Nirupama Bulusu, Member, IEEE, and Wu-chi Feng, Member, IEEE Abstract—For distributed smart camera networks to perform vision-based tasks such as subject recognition and tracking, every camera’s position and orientation relative to a single 3D coordinate frame must be accurately determined. In this paper, we present a new camera network localization solution that requires successively showing a 3D feature point-rich target to all cameras in the network. Using the known geometry of a 3D target, cameras estimate and decompose projection matrices to compute their position and orientation relative to the coordinatization of the 3D target’s feature points. As each 3D target position establishes a distinct coordinate frame, cameras that view more than one 3D target position compute translations and rotations relating different positions’ coordinate frames, then share the transform data with neighbors to realign all cameras to a single coordinate frame established by one chosen target position. Compared to previous localization solutions that use opportunistically found visual data, our solution is more suitable to battery-powered, processing-constrained camera networks be- cause it only requires pairwise view overlaps of sufficient size to see the 3D target and detect its feature points, and only requires communication to determine simultaneous target viewings and for the passing of the transform data. Finally, our solution gives camera positions in a 3D coordinate frame with meaningful units. We evaluate our algorithm in both real and simulated smart camera network deployments. In the real deployment, position error is less than 1” when the 3D target’s feature points fill only 2.9% of the frame area. Index Terms—Camera calibration, smart cameras, camera network localization. I. I NTRODUCTION D ISTRIBUTED smart camera networks consist of multiple cameras whose visual data is collectively processed to perform a task. The area covered by distributed smart camera networks can be small, viewing only a table for 3D capture and reconstruction of an object, covering a room, perhaps in a health care facility, or covering a very large area, such as an office building, airport, or outdoor environment for documentation, surveillance, or security. Localization of a smart camera network means to determine all camera positions and orientations relative to a single 3D coordinate frame. Once localized, distributed smart camera networks can track a subject moving through the network by determining the subject’s trajectory and triggering other cameras that are likely to soon view the subject. If the localization method provides camera positions in meaningful units such as feet or meters, the network can determine the actual size, depth, and position of detected subjects and objects, facilitating recognition and movement interpretation. Due to obstructions in the deployment environment, such as walls or uneven terrain, hand-measuring camera positions and orientations is time consuming and prone to error. GPS is not accurate enough for vision-based tasks, nor does it provide camera orientation information. It is possible to use a network’s available visual data to accurately localize the network, but these techniques impose a deployment constraint: the network’s vision graph—in which vertices are cameras and edges indicate some view overlap—must be connected. A connected vision graph not only implies that each camera’s view overlaps at least one other camera’s, but also that some cameras in the network, if not most, have separate view overlaps with two or more cameras. Vision-based localization has been well studied. The most recent solutions opportunistically search for robustly identi- fiable world features and correlate them between pairs of cameras with view overlaps [1], [2], [3]. Correlated features are used to estimate either the essential or fundamental matrix for two view overlapping cameras and which decomposed provides the camera pair’s relative position and orientation, which is the data needed for network localization [4], [5]. The appeal of essential and fundamental matrix estimation localization methods—that they require image data only— can also be considered a shortcoming because they can only provide relative camera positions only up to an unknown scale factor, one which will vary for each separate pairwise localization. To adjust each pairwise localization to fit into a single network-wide coordinate frame, some solutions require triple-wise camera overlaps, implying the need for densely deployed networks. More recent solutions wave an LED-lit rod of known length through every camera’s view, providing the means to establish a consistent scale [6], [7]. Our localization solution expands the advantage of the LED- lit rod by using a simple, 3D feature point-rich target of known geometry. A 3D target provides all feature points in one frame needed by one camera to determine its position and orientation relative to the target. Figure 1 shows a 3D target we designed and used to localize a small network. It has 288 3D feature points, far more than are needed for accurate localization, and includes colored areas to facilitate detection and correlation of the feature points projected to an image. When a smart camera images the 3D target, it uses the well known DLT method [4], [8] to estimate a projection matrix from the feature points’ known 3D and detected 2D coordi- nates. It then decomposes the estimated projection matrix to extract its position and orientation relative to the 3D target’s coordinate frame, which is represented by the coordinatization of the 3D feature points. While the 3D target shown in Figure 1 is only one possible design, a 3D target is required for projection matrix estimation from a single frame. 2D planar