MULTI-VIEW AUDIO-ARTICULATORY FEATURES FOR PHONETIC RECOGNITION ON RTMRI-TIMIT DATABASE Ioannis K. Douros, Athanasios Katsamanis, Petros Maragos School of Electrical and Computer Engineering National Technical University of Athens, Athens 15773, Greece ioandouros@gmail.com, nkatsam@cs.ntua.gr, maragos@cs.ntua.gr ABSTRACT In this paper, we investigate the use of articulatory informa- tion, and more speciﬁcally real time Magnetic Resonance Imaging (rtMRI) data of the vocal tract, to improve speech recognition performance. For the purpose of our experiments, we use data from the rtMRI-TIMIT database. Firstly, Scale Invariant Feature Transform (SIFT) features are extracted for each video frame. Afterwards, the SIFT descriptors of each frame are transformed to a single histogram per picture, by using the Bag of Visual Words methodology. Since this kind of articulatory information is difﬁcult to acquire in typical speech recognition setups we only consider it to be available in the training phase. Thus, we use a multi-view setup ap- proach by applying Canonical Correlation Analysis (CCA) to visual and audio data. By using the transformation matrix, acquired during the training stage, we transform both train and test audio data to produce MFCC-articulatory features, which form the input for the recognition system. Experimen- tal results demonstrate improvements in phone recognition in comparison with the audio-based baseline. Index Terms— SIFT features, Canonical Correlation Analysis, Bag of Visual Words, multi-view, rtMRI-TIMIT 1. INTRODUCTION Speech recognition systems, by harnessing the power of deep neural networks, have achieved signiﬁcant performance gains in recent years. However, there is still room for improvement, especially when the acoustic conditions are not ideal, as for example, when there is background noise or reverberation. To overcome these problems, various approaches have been proposed, quite a few of which are based on the successful exploitation of another modality, e.g., facial information, that may be available in parallel with audio during speech produc- tion. For example, visual features from the face, like Dis- crete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), and Active Appearance Model coefﬁcients combined with audio features have been used in audiovisual recognition setups to lower recognition error [1, 2]. There is also great interest around articulatory information, in the form of, e.g., Electromagnetic Articulography (EMA), X-ray Microbeam (XRMB), and real-time MRI data of the vocal tract, and how it could beneﬁt speech technologies [3]. In this direction, we particularly focus on rt-MRI data of speech production and use them to improve speech recognition performance. Our proposed scheme is based on the multi-view ap- proach. The main idea is about employing different kinds of measurements (views) gathered at the same time and for the same task, with the goal to use one of the views to train ef- fective transformations of the other view. Usually, two views are used but this is not mandatory. For speech recognition, popular views are audio with visual or articulatory features. Another option is to use the labels themselves but in practice this is not very common. In contrast to multi-modal setups, multi-view can handle data with two views, one of which is possibly available only at the training phase. Usually, CCA is used for the transformation to be learned. Such a setup was ﬁrstly used in [4] for speaker recognition. Similar setups have been used with success for speech recognition like in [5] which uses the XRMB database. In this paper we adapt this technique to the rtMRI-TIMIT [6] dataset. Although the MRI image quality is not very good, we expect to improve audio- based speech recognition results as the view of the entire vocal tract which is available in this dataset is expected to provide (to some degree) complementary information to the audio stream. The rtMRI-TIMIT database has also been used for phone classiﬁcation in [7] but the classiﬁcation in that case is only broad and requires human interaction for placing a masks on each speaker’s midsagittal view by manually lo- cating the nose of the speaker at the start of each utterance. To the best of our knowledge, we are unaware of any previous work on the MRI-TIMIT database for phone recognition that requires no human involvement. In our study, the SIFT features are used for describing each video frame. By applying the Bag of Visual Words technique we transform those descriptors into one histogram per image. We extract MFCCs which are, together with the visual-articulatory histograms, the two views of our experi- ment. Finally we employ the multi-view setup using CCA. Experimental results demonstrate improvements in phone recognition in comparison with the audio-based baseline.