An Exploratory Analysis of Partner Action and Camera Control in a Video-Mediated Collaborative Task Abhishek Ranjan Department of Computer Science University of Toronto www.dgp.toronto.edu aranjan@dgp.toronto.edu Jeremy P. Birnholtz Knowledge Media Design Institute University of Toronto www.kmdi.utoronto.ca jeremy@kmdi.utoronto.ca Ravin Balakrishnan Department of Computer Science University of Toronto www.dgp.toronto.edu ravin@dgp.toronto.edu ABSTRACT This paper reports on an exploratory experimental study of the relationships between physical movement and desired visual information in the performance of video-mediated collaborative tasks in the real world by geographically distributed groups. Twenty-three pairs of participants (one “helper” and one “worker”) linked only by video and audio participated in a Lego construction task in one of three experimental conditions: a fixed scene camera, a helper-controlled pan-tilt-zoom camera, and a dedicated operator-controlled camera. “Worker” motion was tracked in 3-D space for all three conditions, as were all camera movements. Results suggest performance benefits for the operator-controlled condition, and the relationships between camera position/movement and worker action are explored to generate preliminary theoretical and design implications. Categories and Subject Descriptors H.5.3 Group and Organization Interfaces – Computer-supported Cooperative Work General Terms Design, Experimentation, Human Factors. Keywords Camera control, computer-supported cooperative work, collaboration, video mediated communication, video conferencing, motion tracking, computer vision, empirical studies. 1. INTRODUCTION There are a range of settings in which expert assistance may be required by a novice who is completing a complex real-world task. Experts are not always physically proximate, however, so there is increasing interest in the use of collaboration technologies for tasks such as surgery in remotely located hospitals [2, 22], repair of equipment in remote locations (e.g., jet engines, etc.), operation of scientific equipment [7, 18] and others. In the development of technologies to support these tasks, there is growing evidence to suggest the importance of providing the remote expert (the “helper”) with a video view of the workspace where the physical task is being performed by the “worker” [11, 19]. This shared visual context can then be used to facilitate the negotiation of “common ground” in the ongoing conversation between the helper and worker [6]. Providing this shared visual context, however, can be difficult when the task involves detailed manipulation and identification of objects, while still requiring a higher level overview of what is taking place. Fixed-view “scene cameras” provide a useful overview, but little detail [11], while a camera mounted on the worker’s head provides greater detail, but constrains the helper’s view to what the worker is focused on [10]. While it is possible to provide detail and overview by allowing the user to control the camera or select between multiple shots, this has been shown to be potentially distracting, confusing and time-consuming [10, 12]. An alternative approach proposed by Ou, et al. [26, 27] is to automate the provision of dynamic visual information by predicting what the helper will want to see. In this paper, we build on Ou, et al.’s exploratory work by comparing three camera control conditions and by using high-quality motion tracking technology to track worker motion in three-dimensional space. We will show some evidence pointing to the utility of automated camera control, and identify patterns in behavior that can be used to develop design heuristics for future collaboration technologies. 2. BACKGROUND AND RELATED WORK 2.1 Providing Shared Visual Context Shared visual context has been shown to play an important role in the completion of a range of collaborative tasks [4, 6, 17, 19]. In particular these authors point out that a shared visual space facilitates the negotiation of common ground, or a level of shared understanding of what is being discussed in a conversation between two or more parties [5]. Fussell, et al. [11] point out that, in completing collaborative tasks, people rely on visual cues in the grounding process for monitoring task status, monitoring people’s actions, establishing a joint focus of attention, formulating messages and in monitoring the comprehension of their partner. Video systems necessarily constrain the range of cues that are available to do these things as compared with a face- to-face environment, but have nonetheless been shown to be more useful than audio-only systems in completing collaborative tasks [19]. Moreover, these studies have found that there is typically not a strong need to use visual cues to monitor partner comprehension, though additional work suggests that this may be different if some component of the task requires face monitoring [25] or if users do not share linguistic common ground [32]. In most cases, however, video images of the shared workspace are Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSCW’06,November 4-8, 2006, Banff, Alberta, Canada. Copyright 2006 ACM 1-59593-249-6/06/0011…$5.00. 403