Speech Inversion by dynamic time warping method Robert Wielgat Polytechnic Institute, Division of Electronics and Telecommunications State Higher Vocational School in Tarnów Tarnów, Poland rwielgat@poczta.onet.pl Anita Lorenc Department of Speech Therapy and Applied Linguistics Maria Curie-Sklodowska University Lublin, Poland trochymiuk@gmail.com Abstract — Electromagnetic Articulography (EMA) is a precise method for speech articulators assessment which is carried out by sensors placed mainly on the tongue. Various methods are being developed in order to avoid the assessment by EMA sensors. One of them is speech inversion. Here preliminary research on speech inversion based on dynamic time warping (DTW) method has been described. Mel-frequency cepstral coefficients (MFCC) method has been chosen as the acoustic speech signal parametrization method. RMS errors of the evaluation have been presented and discussed. Keywords— speech inversion; dynamic time warping; mel- frequency cepstral coefficients. I. INTRODUCTION Electromagnetic articulography is a method used for speech articulators assessment usually for tongue movement tracking during pronunciation of phonemes under consideration [1], [2]. This method is very precise and effective but not pleasant for the speaker because of its invasiveness. There are developed various method for non-invasive tongue movement assesment. One of them is acoustic-to-articulatory speech inversion method involving tongue movement trajectory estimation by using acoustic signal or video images [3], [4]. Commonly used methods for speech inversion are these based on hidden Markov models [5], [6], [7]. Acoustic-to-articulatory speech inversion by HMM gives estimation accuracy of articulatory movement about 1÷2 mm on average. Although HMM approach is relatively effective it requires significant amount of data in order to properly train HMMs. The research described in the paper is preliminary one and only small amount of speech material is available at the moment therefore another approach to speech inversion has been chosen. The method presented here is dynamic time warping (DTW) and is based on dynamic programming paradigm. Modeling accuracy of DTW can be higher in comparison with HMM particularly in case of small training sets Błąd! Nie można odnaleźć źródła odwołania.. In order to use DTW a set of several patterns has been recorded. Patterns are sequences of observation (feature vectors) extracted from speech audio signal. Each observation is associated by simultaneously recorded articulatory signals being positions of sensors placed on the tongue and lips. Examined word is compared with the pattern of the same class and optimal alignment between them is calculated. Afterwards sensor positions associated with observations of patterns are assigned to aligned observations of examined word. As a signal parametrization method the MFCC one has been used Błąd! Nie można odnaleźć źródła odwołania.. II. METHODS In order to perform speech inversion process a set of audio and articulograph signals should be recorded. Audio speech signals represent words which are converted into patterns (sequence of observations) by the MFCC method. Articulograph signals are recorded by electromagnetic articulograph (EMA) and are sensor positions in time assigned to appropriate observations. Complete acquisition system description can be found in [9]. The set of patterns with associated EMA sensor positions is divided on the training set and testing set. Sensor positions for the words from testing set are estimated by DTW method. A RMS error is afterwards calculated between estimated and measured sensor positions. A. Electromagnetic Articulography AG 500 electromagnetic articulograph has been used in the research. The main parts of EMA are six big coils producing alternating electromagnetic field of three frequency components. The field induces alternating currents in smaller sensor coils fixed to speech articulators mainly to tongue and lips. Induced currents are analyzed by FFT and energy of each frequency component in the sensor coil is calculated. From energy values position coordinates and orientation of the sensor is estimated. There were 12 EMA sensors positioned in the previously selected points of the speaker’s mobile articulators, using the non-toxic tissue glue to fix them (c.f. Fig. 1). Three sensors had the function of enabling subsequent correction of undesirable head movements that occurred during testing: they were placed on the mastoid processes behind the ears and in the depression between the forehead and the nose. These places were selected because the sensors could not move relative to one another during testing. Two sensors controlling lip movements (LL, UL) were attached on the central facial axis. Another four sensors were placed on the medial part of the tongue: one on the tip (TT), one in the postdorsal area (TB) and two at equal intervals between the outermost sensors (TF, TD). The next sensor were glued onto the left side of the tongue’s upper surface between