View Independent Computer Lip-Reading
Yuxuan Lan, Barry-John Theobald and Richard Harvey
School of Computing Sciences
University of East Anglia
Norwich, UK
Email: {y.lan,b.theobald,r.w.harvey,}@uea.ac.uk
Abstract—Computer lip-reading systems are usually de-
signed to work using a full-frontal view of the face. However,
many human experts tend to prefer to lip-read using an
angled view. In this paper we consider issues related to the
best viewing angle for an automated lip-reading system. In
particular, we seek answers to the following questions: 1) Do
computers lip-read better using a frontal or a non-frontal
view of the face? 2) What is the best viewing angle for a
computer lip-reading system? 3) How can a computer lip-
reading system be made to work independently of viewing
angle? We investigate these issues using a purpose built audio-
visual dataset that contains simultaneous recordings of a
speaker reciting continuous speech at five angles.
We find that the system performs best on a non-frontal view,
perhaps because lip gestures, such as lip-protrusion and lip-
rounding, are more pronounced when viewing from an angle.
We also describe a simple linear mapping that allows us to map
any view of the face to the view that we find to be optimal.
Hence we present a view-independent lip-reading system.
Keywords-visual speech recognition, computer lip-reading,
feature mapping, view-independence
I. I NTRODUCTION AND RELATED WORK
Human lip-reading has a long history but, as the quantita-
tive literature goes, remains a rather esoteric subject. In our
own experience of human lip-readers, they express a pref-
erence for lip-reading people at a slight angle, presumably
because of the visible lip protrusion and rounding. Computer
lip-reading however is dominated by frontal cameras and off-
axis computer lip-reading is much less common. An early
work on this topic is [1], in which a system is trained on each
of frontal (0
◦
) and full left profile (90
◦
) views. Experimental
results demonstrated that a frontal view is superior to the
profile view, and yields a lower word error rate. In [2] a
different conclusion was drawn where the profile view out-
performs the frontal view. However, in the experiment the
profile view uses a different geometric feature to frontal
view, whereas in [1] the same type of visual feature was
used across all views. In [3] a multi-view system was trained
on features of frontal (0
◦
) and two profile views (full left
profile 90
◦
and full right profile -90
◦
). The authors reported
significantly better performance on the frontal than on both
profile angles. To improve the performance on the two
profile views, in [4] a linear mapping was applied in the
feature space to warp visual features, in this case modified
2D DCT features, from the profile to the frontal view. This
enables a system that was trained solely on the frontal view
to work with speech captured from a profile angle. The most
recent work on this topic is [5] where the authors adopted
the approach in [4] and extended the system to other lateral
angles, including 30
◦
and 60
◦
. Its comparison study showed
that mapping in the feature space, again the modified 2D
DCT feature, is superior to the mapping in the image space.
It should be noted that [3], [4] and [5] all choose to map
onto the 0
◦
view without measuring whether it is the best
view. In [5] it is noted that there is no significant difference
between 0
◦
and 30
◦
views from which we infer that 0
◦
might
not be the optimal angle. A further observation is that the
work in [5] uses only a restricted vocabulary which leads
to the problem that the pose variation may exceed the word
variation. We overcome these difficulties by setting-up an
experiment on a linguistically complex task that measures
the performance as a function of angle.
II. DATASET
Many audio-visual speech datasets, for example, AVLet-
ters [6], AVLetters2 [7], TULIPS1 [8], IBM ViaVoice
TM
[9],
GRID [10], were each designed to capture the face only
in a full-frontal view. Most previous research into comput-
erised lip-reading is dominated using this view-point. The
dataset used in [1], [4] contains recordings of 38 subjects
reciting connected digit strings from three camera angles: 0
◦
and ±90
◦
. However, because this dataset does not contain
intermediate viewing angles it is unsuitable for use in our
work as we are interested in measuring how the performance
of computer lip-reading changes as a function of viewing
angle. In [3], the CUAVE dataset [11] was used to investigate
automated lip-reading performance across pose. This dataset
contains speech captured at 0
◦
and ±90
◦
, and also speech
during which the subjects exhibit body and head movements.
However, there are only a small number of recordings where
the angle of the face to the camera is between 0
◦
and ±90
◦
,
and these recordings capture continuous head and body
movements that have not been calibrated. Thus the CUAVE
dataset is unsuitable for our task. In [5] the LTS5 dataset
was used, which contains recordings of 20 native French
speakers reciting digits at 0
◦
, 30
◦
,60
◦
, and 90
◦
viewing
angles. A major limitation of this dataset is that it contains
only isolated digits, and only a relatively small number of
2012 IEEE International Conference on Multimedia and Expo
978-0-7695-4711-4/12 $26.00 © 2012 IEEE
DOI 10.1109/ICME.2012.192
432