Better acoustic normalization in subject independent acoustic-to-articulatory inversion: benefit to recognition Amber Afshan 1 , Prasanta Kumar Ghosh 2 1 Department of Electrical Engineering, University of California, Los Angeles, USA, 2 Department of Electrical Engineering, Indian Institute of Science (IISc), Bangalore 560012, India. amberafshan0107@gmail.com, prasantg@ee.iisc.ernet.in Abstract In subject independent acoustic-to-articulatory inversion (SII), the training and test subjects are in general different, whereas subject dependent inversion (SDI) uses the same training and test sub- jects. Thus, acoustic normalization is used to compensate for the mismatch between the training and the test subjects in SII. We show that a better acoustic normalization not only results in better articulatory estimates using SII, but also improves the broad class phonetic recognition accuracy, when the articulatory features esti- mated from SII are used for recognition. Recognition experiments using male and female subjects from the MOCHA-TIMIT corpus also show that there is no significant difference between the recog- nition accuracy using the articulatory features obtained by the best acoustic normalization in SII and that obtained using SDI as well as directly measured articulatory features. Index Terms: broad class phonetic recognition, acoustic-to- articulatory inversion, subject independent inversion, acoustic normalization 1. Introduction Kinematics of speech articulators (e.g., lips, jaw, tongue, velum) recorded during speech production are known to provide cues for automatic speech recognition (ASR) [1, 2]. These articulatory fea- tures are also known to provide information complementary to acoustic features obtained from the speech signal [3]. Record- ing articulatory kinematics is not convenient in practice unlike recording of speech signal. This hinders the use of directly mea- sured articulatory data for ASR. In the absence of directly mea- sured articulatory features, estimating them from the speech sig- nal becomes a plausible option. The task of estimating articula- tory features from acoustic representation is known as acoustic- to-articulatory inversion (AAI) [4]. AAI can be of two types: 1) subject-dependent inversion (SDI), where acoustic-articulatory data from the test subject is available for training AAI algorithm [5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 2) subject-independent inversion (SII) [15, 16, 17], where the test subject can in general be differ- ent from the training subject. SII is more challenging compared to SDI due to the mismatch between the training and test subjects. At the same time, SII is more appropriate compared to SDI when the estimated articulatory features are to be used for ASR on any arbi- trary test subject, because the acoustic or articulatory data from the test subject may not be available a-priori. Thus, in this work we conduct speech recognition using articulatory features estimated using SII. Acoustic normalization is used to compensate for the mis- match between the training and test subjects’ acoustics in SII [16, 17]. This is done by constructing a probability feature vector by transforming the acoustic features of train and test subjects on a generic acoustic space (GAS) consisting of a large pool of acous- tic features from multiple speakers [16]. GAS may not include the acoustic data of the training and test subjects of SII. The proba- bility feature vectors of two subjects are comparable unlike their acoustic feature vectors. It has been shown that the acoustic nor- malization in SII can be improved by appropriately choosing the acoustic units in GAS [17]. For example, the phonetic units are found to be more effective for normalization compared to acous- tic units obtained by unsupervised clustering. Similarly, when the states of a phonetic hidden Markov model (HMM) are used as the acoustic units, the acoustic normalization is even better compared to that using the phonetic units [17]. Better acoustic normaliza- tion, in turn, results in better estimates of the articulatory features. Although the effect of different acoustic normalizations in SII has been studied on the quality of the estimated articulatory fea- tures [17], it is not clear how the ASR performance would change when the estimated articulators using different acoustic normal- izations in SII are used for recognition. In this work, we study the effect of different acoustic normalizations on broad class pho- netic recognition accuracy, the recognition being done based on the estimated articulatory features. The goal is to compare the amount of phonetic cues present in the articulatory features ob- tained using different acoustic normalization techniques. We also compare the recognition accuracies obtained by the articulatory features estimated from SII with those estimated from SDI as well as the directly measured articulatory features. Broad class phonetic recognition experiment reveals that a better acoustic normalization leads to a better recognition accu- racy. This suggests that when the estimated articulatory features match the original ones, they also provide more discrimination among broad phonetic classes. It is also found that, on an average, the recognition accuracy obtained by the articulatory features es- timated using the best acoustic normalization technique in SII is better than that using SDI. Interestingly, the recognition accuracy using articulatory features from SII is found to be similar to that using the directly measured articulatory features. All these find- ings indicate the potential of the articulatory features estimated using SII for phonetic recognition. We begin with the description of the dataset and the acoustic and articulatory features. In section 3, we briefly describe differ- ent acoustic normalization techniques used for comparison in this work. The recognition experiments and results are discussed in section 4. Conclusions and future works are summarized in sec- tion 5. 2. Dataset and features For the recognition experiments and AAI in our work, we have used the Multichannel Articulatory (MOCHA) database [18]. This ,((( ,&$663