Abstract - Audio visual speech recognition (AVSR) is a technique that uses image processing capabilities in lip reading to aid speech recognition systems in recognizing un- deterministic phones or giving preponderance among near probability decisions where as an lipreading is a technique of understanding speech by visually interpreting the movements of the lips, face and tongue with information provided by the context, language, and any residual hearing. This paper represents method of identifying redundant frames from audio visual data for audio visual speech recognition by low level analysis. Results presented in paper depict the necessity of distinct frame for feature extraction so that efficacy of the recognition system will be improved. Keywords – Low Level Analysis, Principal Component Analysis, Euclidean Distance I. INTRODUCTION An Automatic speech recognition (ASR) for well define applications like dictations and medium vocabulary transaction processing tasks are in relatively controlled environments have been designed. It is observed by the researchers that, the ASR performance was far from human performance in variety of tasks and conditions, indeed ASR to date is very sensitive to variations in the environmental channel (non-stationary noise sources such as speech babbled, reverberation in closed spaces such as car, multi-speaker environments) and style of speech (such as whispered etc)[1]. Lipreading is an auditory, imagery system as a source of speech and image information. It provides the redundancy with the acoustic speech signal but is less variable than acoustic signals; the acoustic signal depends on lip, teeth, and tongue position to the extent that significant phonetic information was obtainable using lip movement recognition alone [2][3]. The intimate relation between the audio and imagery sensor domains in human recognition can be demonstrated with McGurk Effect [4][5]; where the perceiver “hears” something other than what was said acoustically due to the influence of conflicting visual stimulus. The current speech recognition technology may perform adequately in the absence of acoustic noise for moderate size vocabularies; but even in the presence of moderate noise it fails except for very small vocabularies[6][7][8][9]. Humans have difficulty distinguishing between some consonants when acoustic signal is degraded. However, to date all automatic speech reading studies have been limited to very small vocabulary tasks and in most of cases to very small number of speakers. In addition the numbers of diverse algorithms have been suggested in the literature for automatic speechreading and are very difficult to compare, as they are hardly ever tested on any common audio visual databases. Furthermore, most of such databases are very small duration thus placing doubts about generalization of reported results to large population and tasks. There is no specific answer to this but researchers are concentrating more on speaker independent audio-visual large vocabulary continuous speech recognition systems [10]. Many methods have been proposed by researchers in- order to enhance speech recognition system by synchronization of visual information with the speech as improvement on automatic lipreading system which incorporates dynamic time warping, and vector quantization method applied on alphabets, digits and The recognition was restricted to isolated utterances and was speaker dependent [2]. Later Christoph Bregler (1993) had worked on how recognition performance in automated speech perception can be significantly improved & introduced an extension to existing Multi- State Time Delayed Neural Network architecture for handling both the modalities that is acoustics and visual sensor input [11]. Similar work have been done by Yuhas et.al (1993) & focused on neural network for vowel recognition and worked on static images [12]. Paul Duchnowski et.al (1995) worked on movement invariant automatic lipreading and speech recognition [13], Juergen Luettin (1996) used active shape model and hidden markov model for visual speech recognition [14], K.L. Sum et.al (2001) proposed a new optimization procedure for extracting the point-based lip contour using active shape model [16], Capiler (2001) used Active shape model and Kalman filtering in spatiotemporal for noting visual deformations [17], Ian Matthews et.al (2002) has proposed method for extraction of visual features of lipreading for audio-visual speech recognition [18], Xiaopeng Hong et.al (2006) used PCA based DCT features Extraction method for lipreading [19], Takeshi Saitoh et.al (2008) has analyzed efficient lipreading method for various languages where they focused on limited set of words from English, Japanese, Nepalese, Chinese, Mongolian. The words in English and their translated words in above listed languages were considered for the experiment [20]; Meng Li et.al (2008) has proposed A Novel Motion Based Lip Feature Extraction for Lipreading problems [21]. Detection of Redundant Frame in Audio Visual Speech Recognition using Low Level Analysis P L Yannawar 1 , G R Manza 2 , B W Gawali 1 , S C Mehrotra 1 1 Department of Computer Science and IT, B A M University, Aurangabad (MS), India 2 Planning and Statistics, Dr. B A M University, Aurangabad, India (pravinyannawar@gmail.com) 978-1-4244-5967-4/10/$26.00 ©2010 IEEE