978-1-4673-8564-0/15/$31.00 ©2015 IEEE Proc. 5th National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics 2015 (NCVPRIPG 2015), Patna, India, Dec. 16-19, 2015, paper ID 88. 1/4 Place of Articulation from Direct Imaging for Validation of Its Estimation from Speech Analysis for Use in Speech Training K. S. Nataraj and Prem C. Pandey Department of Electrical Engineering Indian Institute of Technology Bombay Mumbai 400076, India Email: {natarajks, pcpandey} @ ee.iitb.ac.in Abstract—Place of articulation obtained by analysis of the speech signal is useful for visual feedback of articulatory efforts for speech training of hearing impaired children and for improving pronunciation by learners of second languages. Its estimation by direct imaging of the oral cavity is needed for validating the estimation from the speech signal. For such applications, an automated technique is presented for estimating the place of articulation by graphical processing of the upper and lower contours of the oral cavity image. It iteratively estimates the axial curve as an axis of symmetry of the oral cavity, such that the curve approximately bisects the normals to it. Distance between the contours along the normal to the axial curve gives the oral cavity opening and position of the smallest opening provides the place of articulation. The values estimated using the automated technique closely matched those obtained by manual marking of the visually estimated place of maximum constriction for the oral cavity images of vowels, stops, and fricatives, from the XRMB and MRI databases. Keywords—axial curve; oral cavity opening; place of articulation; speech training I. INTRODUCTION Hearing-impaired children have difficulty in acquiring ability to control the articulators involved in speech production, due to lack of auditory feedback. Speech training aids can provide non-auditory feedback to help the process of speech acquisition. A visual feedback of articulatory efforts has been found to be useful in improving vowel articulation by the hearing-impaired children [1], [2]. It can also help a second-language learner in improving pronunciation [3]. Place of articulation, i.e., the place of maximum constriction in the oral cavity is the most significant information for speech training. It can be estimated using imaging techniques, acoustic measurements, and analysis of the speech signal. However, only the estimation obtained by analysis of the speech signal can be used for speech training. In the commonly used methods, analysis of the speech signal is based on linear predictive coding with the oral cavity modeled as a lossless acoustic tube with plane wave propagation [2], [4], [5]. Although the lower and upper contours of the oral cavity have varying curvatures, the cavity is modeled as a tube with equal-length segments of varying cross-section areas. Place of articulation estimated by speech analysis needs to be validated with reference to the value obtained from direct imaging of the oral cavity during speech production. Direct imaging provides upper and lower contours of the oral cavity. Irregular shapes of the contours cause difficulties in locating the maximum constriction and finding its distance from the lips in a consistent manner. As a solution to this problem, an automated technique involving graphical processing of the contours is presented. The acoustic wave propagation is assumed to be along an axial curve between the two contours with the normal to the curve representing the wave front. The axial curve is iteratively estimated as an axis of symmetry, approximately bisecting the normals to it. Segment of the normal to the axial curve between the two contours provides an estimate of the oral cavity opening. Values of oral cavity opening as a function of distance from the lips are used for estimating the place of articulation. The second section provides a review of the direct imaging techniques and some of the earlier methods for estimation of place of articulation. The proposed automated technique is presented in the third section. The test results are presented in the fourth section, followed by conclusion in the last section. II. PLACE OF ARTICULATION BY DIRECT IMAGING Commonly used direct imaging techniques to capture the movement of the articulators in the mid-sagittal plane (side view) during speech production are ultrasound imaging, electropalatography (EPG), X-ray microbeam (XRMB), electromagnetic articulography (EMA), and magnetic resonance imaging (MRI) [6]–[11]. Speech production databases have been developed using the last three techniques. The XRMB technique [6] uses narrow X-ray beams for recording articulatory movements in the mid-sagittal plane by tracking the gold pellets glued to the articulators. It permits simultaneous recording of speech signal. The database [7], developed at the University of Wisconsin, provides articulatory plots consisting of four pellet points (T1-T4) on the tongue and one each on the upper lip (UL), lower lip (LL) and incisor (MNi), at 160 frames/s and audio recordings for vowels, vowel-consonant-vowel syllables, sentences, and paragraphs, from 25 male and 22 female speakers. The research is supported by the National Program on Perception Engineering Phase II, sponsored by the Department of Electronics & Information Technology, MCIT, Government of India.