Automatic visual augmentation for concatenation based synthesized articulatory videos from real-time MRI data for spoken language training Chandana S, Chiranjeevi Yarra 1 , Ritu Aggarwal, Sanjeev Kumar Mittal, Kausthubha N K, Raseena K T, Astha Singh, Prasanta Kumar Ghosh 2 Electrical Engineering, Indian Institute of Science (IISc), Bangalore-560012, India { 1 chiranjeeviy, 2 prasantg}@iisc.ac.in Abstract For the beneﬁt of spoken language training, concatenation based articulatory video synthesis has been proposed in the past to overcome the limitation in the articulatory data record- ing. For this, real time magnetic resonance imaging (rt-MRI) video image-frames (IFs) containing articulatory movements have been used. These IFs require a visual augmentation for better understanding. We, in this work, propose an augmenta- tion method using pixel intensities in the regions enclosed by the articulatory boundaries obtained from air-tissue boundaries (ATBs). Since, the pixel intensities reﬂect the muscle move- ments in the articulators, the augmented IFs could provide re- alistic articulatory movements, when we color them accord- ingly. However, the ATB manual annotation is time consuming; hence, we propose to synthesize ATBs using the ATBs from a few selected frames that have been used in synthesizing the ar- ticulatory videos. We augment a set of synthesized articulatory videos for 50 words obtained from the MRI-TIMIT database. Subjective evaluation on the quality of the augmented videos using twenty-one subjects suggests that the videos are visually more appealing than the respective synthesized rt-MRI videos with a rating of 3.75 out of 5, where a score of 5 (1) indicates that the augmented video quality is excellent (poor). 1. Introduction The pronunciation of the second language (L2) learners, espe- cially learning English, is often effected by several factors [1–3] that are inﬂuenced by their nativity. This happens mainly be- cause the articulatory movements while speaking English are dominated by the articulatory constraints from the speaker’s na- tive language [4]. In order to overcome these constraints, a video that shows correct articulation is used as a feedback to the L2 learners in the applications like computer assisted lan- guage learning (CALL). There have been several results that shows the visualization of the correct (from native speakers, re- ferred as experts) articulatory movements which helps in the pronunciation training [5–10]. In most of the cases, for the training, experts’ articulatory movements are captured using real-time motion capture techniques simultaneously with their audio [6, 11–13]. Further, the articulatory movements, referred to as articulatory video, are added with an augmented reality along with experts’ audio to obtain a ﬁnal video, referred as augmented articulatory video (AA-video) [6, 8, 14–16]. In the existing works, the AA-videos have been constructed using one or more combinations of the articulatory data from electro-magnetic articulography (EMA), computed tomography (CT), ultrasound imaging and real time magnetic resonance imaging (rt-MRI) [7–10, 14, 16, 17]. In constructing the AA- videos, most of the existing works have used an expert from whom both audio and articulatory motion have been recorded. Hence, these techniques have a limitation in using an arbitrary expert‘s audio from whom direct articulatory measurement is Authors thank Pratiksha Trust for their support. not available. In addition, the data acquisition methods used in all of these techniques require specialized equipment, which is time consuming and expensive [18]. However, in the recent past, Desai et al. have proposed a concatenation based synthe- sis approach to obtain an articulatory video for an expert audio which does not have simultaneous articulatory recordings [19]. In their work, they have used rt-MRI videos containing image frames (IFs) of pharyngeal structures in gray scale. We observe that the articulators constituted in those structures do not have a realistic view; hence, the synthesized videos are less self ex- planatory to the L2 learners. However, we hypothesize that an augmented reality can be added automatically to those videos. Thus, an AA-video can be obtained for an audio of an expert for whom direct articulatory measurement is not available. In this work, we add augmented reality to the articulators in each IF belonging to the synthesized articulatory videos. For this, we propose to use pixel values in the IF regions enclosed by the air-tissue boundaries (ATBs, blue and green colored con- tour shown in Figure 1b) that constitute the articulators [20]. Instead of using ATBs of all the IFs in the synthesized videos, we consider the ATBs of few IFs from a repository which are used in a concatenation based approach [19]. This results in a less number of IFs for the ATB annotation thereby requiring less time. In order to obtain ATBs for all the IFs, we propose an ATB synthesis approach in line with the concatenation based articu- latory video synthesis approach. Further, using these ATBs, we apply a knowledge based coloring approach to those structures, for which we propose a set of rules. We evaluate the AA-video quality subjectively using a set of 21 evaluators and 50 words randomly chosen from the MRI-TIMIT data [21]. The average quality rating is found to be 3.75 out of 5 when the evaluators rate the AA-video quality with respect to the corresponding syn- thesized articulatory video. Figure 1: Exemplary rt-MRI IF indicating a) anatomical re- gions b) ATBs C1(t) and C2(t) and respective enclosed re- gions R1(t) and R2(t). c) sub-regions RLL-J(t), REG-T(t), RUL(t) and RP-V(t) and the respective boundaries CLL-J(t), CEG-T(t), CUL(t) and CP-V(t) 2. Database MRI-TIMIT [21] is a phonetically rich database comprising rt- MRI videos, i.e., rt-MRI data with synchronized audio. The rt- MRI data is primarily an IF sequence of the mid-sagittal view (contains pharyngeal structures) of a speaker speaking an utter- ance. The rt-MRI data was captured at a frame rate of 23.18 frames per second with an image resolution of 68×68 pixels in gray scale. The data was collected from two male and two female speakers of American English speaking 460 TIMIT sen- Interspeech 2018 2-6 September 2018, Hyderabad 3127 10.21437/Interspeech.2018-1570