A Lip Reading Model Using CNN With Batch Normalization Saquib Nadeem Hashmi Department of CSE saquibnadeemhashmi@gmail.com Harsh Gupta Department of CSE hgupta.110695@gmail.com Aparajita Nanda Department of CSE aparajita.nanda@jiit.ac.in Dhruv Mittal Department of CSE Dhruvmittal1996@gmail.com Kaushtubh Kumar Department of CSE kaustubhkmr@gmail.com Sarishty Gupta Department of CSE Sarishty.gupta@jiit.ac.in Abstract - The goal of Lip-reading is to decode and analyze the lip movements of a speaker for a said word or phrase. Variation in speaking speed, intensity and same lip sequence of distinct characters have been the challenging aspects of lip reading. In this paper we present a lip reading model for an audio-less video data of variable-length sequence frames. First, we extract the lip region from each face image in the video sequence and concatenate them to form a single image. Next, we design a twelve-layer Convolutional Neural Network with two layer of batch normalization for training the model and to extract the visual features end-to-end. Batch normalization helps to reduce the internal and external variances in various attributes like speaker’s accent, lighting and quality of image frames, pace of the speaker and posture of speaking etc. We validate the performance of our model on a standard audio-less video MIRACLE-VC1 dataset and compare with an existing model which uses 16 layers CNN or more. A training accuracy of 96% and a validation accuracy of 52.9% have been attained on the proposed lip reading model. Keywords—Lip reading, CNN, Batchnorm, Variances I. INTRODUCTION Visual lip-reading is the process of decoding text from speaker’s lip movement. It plays a crucial role in communication and speech understanding where audio and voice are difficult to recognize. Persons with hearing disability can considerably get benefited from such useful hearing aid. However, machine lip-reading faces several challenges like variations in speaking speeds, pronunciations, intensities, and same lip sequence of different characters etc. In addition, speaker’s face detection, lip region detection, extracting spatiotemporal features from the sequence of frames complicates the problem. To overcome these issues, lip reading systems are evaluated with limited numbers of speakers and phrases. Hence, dataset for lip reading comprises a set of image sequences (video) of a person speaking a word or phrase. The aim is to classify the said phrase or word from the image sequences. The variation in length of the image sequence is another issue which yields wide variation in number of features. Studies have shown that human lip-reading performance increases for longer words indicating the importance of features capturing temporal context in an ambiguous communication channel. The involvement of artificial intelligence in lip reading makes it more automated and efficient. Therefore, existing approaches vary from a traditional visual feature extraction and learning to deep learning based techniques. Variation in length of image sequence is a major factor in varying the features as well as accuracy. In this paper, we present a lip-reading model, that maps a variable-length sequence of video frames to text, making use of deep neural networks (Fig 1). The contributions for designing this models are: i) forming a concatenated stretched image by extracting the lip regions from each face image in the video sequence. ii) design a twelve-layer Convolutional Neural Network along with two layer of batch normalization for training and extracting the visual features end-to-end. Batch normalization helps to reduce the internal and external variances which leads to further reduce the variations in speaker’s accent, lighting and quality of image, pace of the speaker and posture of speaking. Moreover, we validate our model on standard dataset and compare with an existing approach. The paper is organized as follows. Related work is described in Section II, Section III details the proposed model, Section IV presents the experimental analysis followed by conclusion in Section V. II. RELATED WORK In this section we briefly describe the existing work on lip- reading. The old line approaches concentrates on traditional way of extraction and learning of the visual features, whereas recent development focus on deep learning based techniques to extract those features end-to-end. Gergen et al. [1] develop a lip-reading model utilizes speaker-dependent training on an