Impact of Gaze Analysis on the Design of a Caption Production Software Claude Chapdelaine, Samuel Foucher and Langis Gagnon, R&D Department, Computer Research Institute of Montreal (CRIM), 550 Sherbrooke Street West, Suite 100, Montreal (Quebec) H3A 1B9 Claude.Chapdelaine@crim.ca Abstract. Producing caption for the deaf and hearing impaired is a labor intensive task. We implemented a software tool, named SmartCaption, for assisting the caption production process using automatic visual detection techniques aimed at reducing the production workload. This paper presents the results of an eye-tracking analysis made on facial regions of interest to understand the nature of the task, not only to measure of the quantity of data but also to assess its importance to the end-user; the viewer. We also report on two interaction design approaches that were implemented and tested to cope with the inevitable outcomes of automatic detection such as false recognitions and false alarms. These approaches were compared with a Keystoke-Level Model (KLM) showing that the adopted approach allowed a gain of 43% in efficiency. Keywords: Caption production, eye-tracking analysis, facial recognition, Keystoke-Level Model (KLM). 1 Introduction Producing caption for the deaf and hearing impaired required transcribing what is being said and interpreting the sounds being heard. The produced text must then be positioned and synchronized on the image. This is a very labor intensive production task that is expensive and for which turn-around time can be a serious bottleneck. Nowadays, the process can be optimized by using automatic speech recognition (ASR) to reduce the transcribing time. Even so, positioning and synchronizing can remain a demanding task for which, up to now, there is no available solution to assist the captioners. The goal of this project is to implement and evaluate the feasibility of automatic visual detection techniques (AVDT) to efficiently reduce the time required to position and synchronize text for off-line captioning. However, adding automatic recognition technologies must be carefully implemented to be usable. Indeed, the added ASR technology is effective inasmuch as the error rate is significantly lower than the time needed to transcribe manually. This is also true when adding AVDT. The missed detections, substitutions and false alarms have to be kept to a minimum. Since the actual state-of-the-art technology does not allow us to design software with perfect detection and recognition performances, the potential errors have to be taken into