Abstract - Audio visual speech recognition (AVSR) is a
technique that uses image processing capabilities in lip
reading to aid speech recognition systems in recognizing un-
deterministic phones or giving preponderance among near
probability decisions where as an lipreading is a technique of
understanding speech by visually interpreting the
movements of the lips, face and tongue with information
provided by the context, language, and any residual hearing.
This paper represents method of identifying redundant
frames from audio visual data for audio visual speech
recognition by low level analysis. Results presented in paper
depict the necessity of distinct frame for feature extraction
so that efficacy of the recognition system will be improved.
Keywords – Low Level Analysis, Principal Component
Analysis, Euclidean Distance
I. INTRODUCTION
An Automatic speech recognition (ASR) for well
define applications like dictations and medium vocabulary
transaction processing tasks are in relatively controlled
environments have been designed. It is observed by the
researchers that, the ASR performance was far from
human performance in variety of tasks and conditions,
indeed ASR to date is very sensitive to variations in the
environmental channel (non-stationary noise sources such
as speech babbled, reverberation in closed spaces such as
car, multi-speaker environments) and style of speech
(such as whispered etc)[1].
Lipreading is an auditory, imagery system as a source
of speech and image information. It provides the
redundancy with the acoustic speech signal but is less
variable than acoustic signals; the acoustic signal depends
on lip, teeth, and tongue position to the extent that
significant phonetic information was obtainable using lip
movement recognition alone [2][3]. The intimate relation
between the audio and imagery sensor domains in human
recognition can be demonstrated with McGurk Effect
[4][5]; where the perceiver “hears” something other than
what was said acoustically due to the influence of
conflicting visual stimulus. The current speech
recognition technology may perform adequately in the
absence of acoustic noise for moderate size vocabularies;
but even in the presence of moderate noise it fails except
for very small vocabularies[6][7][8][9]. Humans have
difficulty distinguishing between some consonants when
acoustic signal is degraded.
However, to date all automatic speech reading studies
have been limited to very small vocabulary tasks and in
most of cases to very small number of speakers. In
addition the numbers of diverse algorithms have been
suggested in the literature for automatic speechreading
and are very difficult to compare, as they are hardly ever
tested on any common audio visual databases.
Furthermore, most of such databases are very small
duration thus placing doubts about generalization of
reported results to large population and tasks. There is no
specific answer to this but researchers are concentrating
more on speaker independent audio-visual large
vocabulary continuous speech recognition systems [10].
Many methods have been proposed by researchers in-
order to enhance speech recognition system by
synchronization of visual information with the speech as
improvement on automatic lipreading system which
incorporates dynamic time warping, and vector
quantization method applied on alphabets, digits and The
recognition was restricted to isolated utterances and was
speaker dependent [2]. Later Christoph Bregler (1993)
had worked on how recognition performance in
automated speech perception can be significantly
improved & introduced an extension to existing Multi-
State Time Delayed Neural Network architecture for
handling both the modalities that is acoustics and visual
sensor input [11]. Similar work have been done by Yuhas
et.al (1993) & focused on neural network for vowel
recognition and worked on static images [12].
Paul Duchnowski et.al (1995) worked on movement
invariant automatic lipreading and speech recognition
[13], Juergen Luettin (1996) used active shape model and
hidden markov model for visual speech recognition [14],
K.L. Sum et.al (2001) proposed a new optimization
procedure for extracting the point-based lip contour using
active shape model [16], Capiler (2001) used Active
shape model and Kalman filtering in spatiotemporal for
noting visual deformations [17], Ian Matthews et.al
(2002) has proposed method for extraction of visual
features of lipreading for audio-visual speech recognition
[18], Xiaopeng Hong et.al (2006) used PCA based DCT
features Extraction method for lipreading [19], Takeshi
Saitoh et.al (2008) has analyzed efficient lipreading
method for various languages where they focused on
limited set of words from English, Japanese, Nepalese,
Chinese, Mongolian. The words in English and their
translated words in above listed languages were
considered for the experiment [20]; Meng Li et.al (2008)
has proposed A Novel Motion Based Lip Feature
Extraction for Lipreading problems [21].
Detection of Redundant Frame in Audio Visual Speech Recognition using Low
Level Analysis
P L Yannawar
1
, G R Manza
2
, B W Gawali
1
, S C Mehrotra
1
1
Department of Computer Science and IT, B A M University, Aurangabad (MS), India
2
Planning and Statistics, Dr. B A M University, Aurangabad, India
(pravinyannawar@gmail.com)
978-1-4244-5967-4/10/$26.00 ©2010 IEEE