Lip-Synchronization for Dubbed Instructional Videos Abhishek Jha * 1 Vikram Voleti * 1 Vinay P. Namboodiri 2 C. V. Jawahar 1 1 Center for Visual Information Technology, KCIS, IIIT Hyderabad, India 2 Department of Computer Science and Engineering, IIT Kanpur, India {abhishek.jha@research, jawahar@}.iiit.ac.in, vikram.voleti@gmail.com, vinaypn@iitk.ac.in Abstract Online instructional video lectures such as MOOCs are often limited by linguistic constraint of different demogra- phies. Students from backgrounds that are non-native to the accent or language of the instructor often find it difficult to comprehend the full lecture, which leads to lower retention rates of the courses. Simple audio dubbing in the accent or language of the student makes the video appear unnatural. In this paper, we propose two lip synchronization meth- ods — one for audio dubbed in the non-native accent of the student, and another with audio in the foreign language of the student. We describe an automated pipeline to synchro- nize the lip movements of the instructor with the audio in both cases. With the help of a user-based study, we verify that our method is is preferred over unsynchronized videos. 1. Introduction Online instructional videos, especially Massive Open Online Courses (MOOCs), are prime examples of how ed- ucation can help skill development beyond the boundaries of conventional classrooms. Yet the retention rates in these courses can be as low as 10%. One of the major reasons for this is a cultural gap between the linguistics of the stu- dent and the instructor. Students from different parts of the world often find it difficult to understand the accent and lan- guage of the instructors, owing to their non-familiarity with it. This results in slow learning curves as well as dropouts from such online courses. Subtitles in different languages do not lend enough help since they divert the attention of the student. A quick-fix solution to this would be to dub in- structional videos in the accent or language of the student. However, dubbing without lip synchronization makes the video appear unnatural. In this paper, we propose ‘Visual Dubbing’ for syn- chronizing lip motion in instructional videos according to the language it is dubbed in. Our main ideas and con- tributions are two-fold: 1) we propose an English-to-non- * these authors contributed equally to this work. Figure 1: (top) Dynamic Programming to non-native En- glish accent, (bottom) Visual Dubbing to other language Native-English approach to dubbing online educational tu- torial videos originally in English to a non-native English accent, such as Indian accent or French accent; 2) we pro- pose an English-to-Foreign-Language approach to dubbing videos such that the lip movements warp to match the audio in the new language. Lastly we show how the generated lip- motion or ‘Visual Dubbing’ makes the instructional video more engaging based on a used-based-study. Some of the recent work in this area focuses on syn- thesizing photo-realistic lip motions and facial expressions. Face2Face [4] morphs the facial landmarks of a person based on those of another actor. But it requires a human in the loop which can be quite expensive and erroneous. Most similar to our work are [3, 2] which use speech au- dio represented as MFCC features [3] and text [2] to train an LSTM to produce a sequence of lip landmark points. The lip landmarks are then used to generate mouth texture. Fi- nally this mouth texture is merged with the face in the orig- inal frame. Our work is different from [3, 2] in that our method synchronizes lip motion across two different lan- guages, in contrast to just English-to-English. Hence, our challenges include learning higher-level viseme-phonemic relations across two different languages. 2. Method Instructional videos provide a controlled framework for this problem, since the speakers usually speak scripted di- alogues in good lighting facing the camera. The challenge is to model the lip movements given the dubbed audio, and generate new lip movements for the same speaker. 1