View Independent Computer Lip-Reading Yuxuan Lan, Barry-John Theobald and Richard Harvey School of Computing Sciences University of East Anglia Norwich, UK Email: {y.lan,b.theobald,r.w.harvey,}@uea.ac.uk Abstract—Computer lip-reading systems are usually de- signed to work using a full-frontal view of the face. However, many human experts tend to prefer to lip-read using an angled view. In this paper we consider issues related to the best viewing angle for an automated lip-reading system. In particular, we seek answers to the following questions: 1) Do computers lip-read better using a frontal or a non-frontal view of the face? 2) What is the best viewing angle for a computer lip-reading system? 3) How can a computer lip- reading system be made to work independently of viewing angle? We investigate these issues using a purpose built audio- visual dataset that contains simultaneous recordings of a speaker reciting continuous speech at five angles. We find that the system performs best on a non-frontal view, perhaps because lip gestures, such as lip-protrusion and lip- rounding, are more pronounced when viewing from an angle. We also describe a simple linear mapping that allows us to map any view of the face to the view that we find to be optimal. Hence we present a view-independent lip-reading system. Keywords-visual speech recognition, computer lip-reading, feature mapping, view-independence I. I NTRODUCTION AND RELATED WORK Human lip-reading has a long history but, as the quantita- tive literature goes, remains a rather esoteric subject. In our own experience of human lip-readers, they express a pref- erence for lip-reading people at a slight angle, presumably because of the visible lip protrusion and rounding. Computer lip-reading however is dominated by frontal cameras and off- axis computer lip-reading is much less common. An early work on this topic is [1], in which a system is trained on each of frontal (0 ) and full left profile (90 ) views. Experimental results demonstrated that a frontal view is superior to the profile view, and yields a lower word error rate. In [2] a different conclusion was drawn where the profile view out- performs the frontal view. However, in the experiment the profile view uses a different geometric feature to frontal view, whereas in [1] the same type of visual feature was used across all views. In [3] a multi-view system was trained on features of frontal (0 ) and two profile views (full left profile 90 and full right profile -90 ). The authors reported significantly better performance on the frontal than on both profile angles. To improve the performance on the two profile views, in [4] a linear mapping was applied in the feature space to warp visual features, in this case modified 2D DCT features, from the profile to the frontal view. This enables a system that was trained solely on the frontal view to work with speech captured from a profile angle. The most recent work on this topic is [5] where the authors adopted the approach in [4] and extended the system to other lateral angles, including 30 and 60 . Its comparison study showed that mapping in the feature space, again the modified 2D DCT feature, is superior to the mapping in the image space. It should be noted that [3], [4] and [5] all choose to map onto the 0 view without measuring whether it is the best view. In [5] it is noted that there is no significant difference between 0 and 30 views from which we infer that 0 might not be the optimal angle. A further observation is that the work in [5] uses only a restricted vocabulary which leads to the problem that the pose variation may exceed the word variation. We overcome these difficulties by setting-up an experiment on a linguistically complex task that measures the performance as a function of angle. II. DATASET Many audio-visual speech datasets, for example, AVLet- ters [6], AVLetters2 [7], TULIPS1 [8], IBM ViaVoice TM [9], GRID [10], were each designed to capture the face only in a full-frontal view. Most previous research into comput- erised lip-reading is dominated using this view-point. The dataset used in [1], [4] contains recordings of 38 subjects reciting connected digit strings from three camera angles: 0 and ±90 . However, because this dataset does not contain intermediate viewing angles it is unsuitable for use in our work as we are interested in measuring how the performance of computer lip-reading changes as a function of viewing angle. In [3], the CUAVE dataset [11] was used to investigate automated lip-reading performance across pose. This dataset contains speech captured at 0 and ±90 , and also speech during which the subjects exhibit body and head movements. However, there are only a small number of recordings where the angle of the face to the camera is between 0 and ±90 , and these recordings capture continuous head and body movements that have not been calibrated. Thus the CUAVE dataset is unsuitable for our task. In [5] the LTS5 dataset was used, which contains recordings of 20 native French speakers reciting digits at 0 , 30 ,60 , and 90 viewing angles. A major limitation of this dataset is that it contains only isolated digits, and only a relatively small number of 2012 IEEE International Conference on Multimedia and Expo 978-0-7695-4711-4/12 $26.00 © 2012 IEEE DOI 10.1109/ICME.2012.192 432