Automatic Analysis of Singleton and Geminate Consonant Articulation Using Real-time Magnetic Resonance Imaging Christina Hagedorn 1 , Michael Proctor 1,2 , Louis Goldstein 1 1 Department of Linguistics, University of Southern California, USA 2 Viterbi School of Engineering, University of Southern California, USA chagedor@usc.edu Abstract We explore robust methods of automatically quantifying constriction location, constriction degree and gestural kinematics of Italian short and long consonants using direct image analysis techniques applied to rtMRI data. Articulatory kinematics are estimated from correlated regional changes in pixel intensity. We demonstrate that these methods are capable of quantifying differences in constriction duration exhibited by short and long Italian consonants for labial, coronal and dorsal segments, and differences in constriction degree for labial and coronal consonants. No difference in constriction location is observed for geminates and singletons, while systematic differences in constriction location are observed between (i) coronal oral stops and coronal sonorants and (ii) dorsal stops flanked by vowels differing in backness. Index Terms: speech production, real-time MRI, consonant articulation, Italian, geminates, articulatory phonology. 1. Introduction Studying speech production using real time magnetic resonance imaging (rtMRI) offers advantages over other methods of articulometry. Electro-magnetic articulography [17] and X-ray microbeam [18] provide high temporal and spatial resolution, but only provide information about specific flesh points on the vocal tract, and do not allow precise measurement of constriction location. rtMRI safely allows for the entire vocal tract to be examined at once and provides dynamic information about all components of the vocal tract. This study explores robust, automatic methods of (i) determining constriction location, (ii) estimating constriction degree, and (iii) estimating gestural kinematics based on detected constriction location. Rather than attempting (noisy) segmentation of images along air- tissue boundaries, analyses are performed directly on time functions of pixel intensities [5, 9]. These methods of direct image analysis of MRI data are especially applicable to the study of stop consonant articulation. It is well established that the production of singleton and geminate consonants in standard Italian differ both temporally and spatially [1, 2, 3, 4, 8]. Not only are geminates produced with longer constriction duration than singletons, but it has been observed for Italian coronals using electropalatography (EPG) that more linguo-palatal contact occurs in the production of geminate consonants than in singletons [3]. Further, it has been hypothesized based on EPG data (but not firmly established) that Italian coronal geminate and singleton consonants, and coronal sonorant and stop consonants differ with respect to whether they are produced apically or laminally [3]. Furthermore, there is a lack of data concerning the spatial and kinematic aspects of dorsal consonant production, due to the physiological limitations of EPG and electro-magnetic articulography (factors that do not affect rtMRI). The aim of this study is to reexamine these claims using rtMTI data, and to shed more light on aspects of Italian consonant articulation that are less well understood due to limitations of other methods of articulometry. 2. Data Acquisition An adult male speaker of standard Italian as spoken in Rome was imaged while producing lexical items contrasting singleton and geminate stops, affricates and sonorants (p/pp, m/mm, t/tt, d/dd, l/ll, n/nn, tʃ/ttʃ, dʒ/ddʒ, k/kk, g/gg) using a custom MRI protocol [7]. The subject, lying supine, repeated phrases containing one member of a minimal (or near- minimal) pair, e.g. [pata]-[patta] five times, each token of a given consonant dispersed, in random order. Tokens were designed to elicit target consonants in multiple vowel contexts; carrier phrases were designed to minimize consonantal co- articulation effects on the consonants of interest. A 13-interleaf spiral gradient echo pulse sequence was used (TR = 6.164 msec, FOV = 200 × 200 mm, flip angle = 15◦). Scan slice thickness was 5 mm, located midsagittally; image resolution in the sagittal plane was 68 × 68 pixels (2.9 × 2.9 mm). New image data were acquired at a rate of 18.52 frames/second, and reconstructed as 33.8 frames/sec. video using a sliding window technique. More details about the rtMRI acquisition can be found in [15] 3. Results 3.1. Constriction Location To automatically locate the primary constriction target for each segment of speech, our approach is to find the image pixel in the approximate region of constriction that changes in intensity most systematically as the constriction is formed and released. Two methods were tested for defining this (refer to companion paper [9] for details). The search space within each frame is limited to a set of pixels lying on the palate (dorsals), alveolar ridge (coronals) and upper lip (labials), in addition to a set number of pixels below those points, corresponding to the midsagittal airway. Copyright 2011 ISCA 28 - 31 August 2011, Florence, Italy INTERSPEECH 2011 409