A phonetic-level analysis of different input features for articulatory inversion Abdolreza Sabzi Shahrebabaki 1 , Negar Olfati 1 , Ali Shariq Imran 1 , Sabato Marco Siniscalchi 2 , Torbjørn Svendsen 1 1 Department of Electronic Systems, NTNU 2 Department of Telematics, Kore University of Enna {abdolreza.sabzi, negar.olfati, ali.imran, torbjorn.svendsen}@ntnu.no, marco.siniscalchi@unikore.it Abstract The challenge of articulatory inversion is to determine the tem- poral movement of the articulators from the speech waveform, or from acoustic-phonetic knowledge, e.g. derived from infor- mation about the linguistic content of the utterance. The actual position of the articulators is typically obtained from measured data, in our case position measurements obtained using EMA (Electromagnetic articulography). In this paper, we investigate the impact on articulatory inversion problem by using features derived from the acoustic waveform relative to using linguis- tic features related to the time aligned phone sequence of the utterance. Filterbank energies (FBE) are used as acoustic fea- tures, while phoneme identities and (binary) phonetic attributes are used as linguistic features. Experiments are performed on a speech corpus with synchronously recorded EMA measure- ments and employing a bidirectional long short-term memory (BLSTM) that estimates the articulators’ position. Acoustic FBE features performed better for vowel sounds. Phonetic fea- tures attained better results for nasal and fricative sounds except for /h/. Further improvements were obtained by combining FBE and linguistic features, which led to an average relative RMSE reduction of 9.8%, and a 3% relative improvement of the Pear- son correlation coefficient. Index Terms: Articulatory inversion, language learning, bidi- rectional long short term memory, Attributes, HPRC database 1. Introduction Acoustic to articulatory inversion (AAI) is a challenging prob- lem due to the many-to-one mapping in which different ar- ticulator positions may produce a similar sound. This many- to-one mapping makes AAI a highly non-linear problem. In AAI, the objective is to estimate the vocal tract shape, which is estimated by the articulator positions based on the uttered speech. AAI can be useful in many speech-based applications, in particular, speech synthesis [1], automatic speech recogni- tion (ASR) [2, 3, 4] and second language learning [5, 6]. Over the years, researchers have addressed this problem employing various machine learning techniques including codebooks [7], Gaussian mixture models (GMM) [8], hidden Markov models (HMM) [9], mixture density networks [10], deep neural net- works (DNNs) [11, 12, 13], and deep recurrent neural networks (RNNs) [14, 15, 16]. Exploiting RNNs for the AAI task has demonstrated better results compared to DNNs [14, 16] because the temporal dy- namic behavior is better captured through the memory elements of those recurrent architectures. Acoustic features are com- monly employed at the input of the AAI system [7, 8, 9, 10] , but linguistic features have been successfully used in recent years either as stand-along features [17], or together with acous- tic features [15]. Moreover, representing the linguistic features in a bottleneck form extracted from a phone classifier has been used in [16]. Although leveraging knowledge from linguistic content together with acoustic features has proven to improve AAI systems, a deeper analysis explaining why redundant in- formation makes the system perform better is missing. We think that gaining a better understanding about such a perfor- mance improvement would be helpful for some specific tasks, where the linguistic features are available from the text, e.g. language learning. This motivates us to compare state-of-the- art methods in [16, 17] and carry out additional analyses on the acoustic and linguistic features within phoneme boundaries which later can be employed in pronunciation scoring. That is, we focus on the evaluation in time intervals concerning a single phoneme instead of analyzing the whole EMA trajec- tory for the uttered speech. The rest of the paper is structured as follows. Section 2 presents Deep BLSTM recurrent neural networks. Section 3 describes the “Haskins Production Rate Comparison database”(HPRC) [18]. The database, feature rep- resentation, and the performance measurements undertaken in this study, followed by results in Section 4. Finally, Section 5 concludes the paper. 2. Deep BLSTM recurrent neural network Recurrent neural networks (RNN) have been utilized in many speech technology areas including speech recognition [19], lan- guage modeling [20], and articulatory inversion [14, 15, 16]. They are able to estimate any output samples from dynamical systems [21], conditioned on their previous samples. Having a non-causal condition by access to both past and future input samples, we can employ a bidirectional RNN to use the past samples within the forward layer and the future samples within the backward layer as it is shown in Fig. 1. Diamonds show the merge strategy of forward and backward layers output which can be summation, concatenation, and etc. LSTM is a variant of RNN with a specific memory cell architecture for updating the hidden layers. This memory cell is formulated as follows Output layer Input layer Forward layer Backward layer t-1 t t+1 t+2 Figure 1: A bidirectional RNN. Post-Print version of the accepted paper in INTERSPEECH 2019 conference in Graz, Austria September 15–19, 2019 DOI: http://dx.doi.org/10.21437/Interspeech.2019-2526