PARSING SPEECH INTO ARTICULATORY EVENTS Kadri Hacioglu, Bryan Pellom and Wayne Ward Center for Spoken Language Research University of Colorado at Boulder E-mail: hacioglu,pellom,whw @cslr.colorado.edu ABSTRACT In this paper, the speech production process state is defined by a number of categorical articulatory features. We describe a detec- tor that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) rep- resentation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previ- ously used for articulatory feature detection by many researchers. However, we extend their work first by creating variable depth lat- tices for each feature and then by combining them into product lat- tices for rescoring using the Viterbi algorithm. During the rescor- ing we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classi- fiers. We present our results for place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels. 1. INTRODUCTION The linear symbolic representation of speech at the lowest sym- bolic level using phonemes is very common in state-of-the-art speech recognizers. This is known as the “beads-on-a-string” representation. The drawbacks of this representation have been reported in [1-3]. A natural extension of this representation is the “beads-on-multiple-strings” which suggests a nonlinear multi- dimensional symbolic representation. In the latter, the first chal- lenging issue is the decision on the nature of features (“strings”) and the type of classes (“beads”) for each feature. The second issue is the accurate detection of the classes along each dimen- sion. Many different symbolic feature representations and ways of detecting the feature classes have been reported in [4-14]. Ac- cording to the “beads-on-multiple-strings” approach, a segment of speech is classified into a number of broad classes in multiple di- mensions. We associate the dimensions with articulatory features and the classes with their values. In doing so, a representation of the speech frame, or segment, is obtained as an articulatory feature vector (or state). In turn, a word can be represented by a sequence of feature vectors. Our goal is the accurate detection of the feature streams from the input speech for subsequent word recognition. We propose a detector which is a bank of recurrent neural net- works (RNNs) followed by a product lattice rescoring unit. The outputs of RNNs are the posterior probabilities of classes for each articulatory feature given the acoustic representation in MFCCs. State Detection Speech Production Lexical Access Words Knowledge Sources State to Sub-Lexical Unit Mapping Generation Bottom-up Speech Feature Representation and Extraction State Specification and Detection Lexical Representation and Phonology Lexicon creation and Parsing (optional) Acoustic Feature Top-down Fig. 1. A speech recognition framework. RNNs have been extensively used for the articulatory feature de- tection in [7, 14]. We extend their work by generating lattices of feature classes for each feature stream. These lattices can be either independently or jointly rescored for better performance. The pos- terior probabilities along with language and duration constraints within and across the feature streams are used during rescoring. The final output of the system is a sequence of classes at frame level for each feature stream (tagged with posterior probabilities as indicators of the reliability of the information or evidence passed to higher levels). Articulatory events, or segments of speech, are created by concatenating the repeating classes. We present results that show improved performance at the both frame and segment levels. The framework on which this work is based is in strong agreement with the detection-based framework for speech recog- nition and understanding [17]. Also, the lattice rescoring strongly overlaps with the notion of event lattices advocated in [18, 19]. The paper is organized as follows. In Section 2, we present the framework for our ongoing research toward a system that uses articulatory features in speech recognition. We discuss the articu- latory feature representation that we are currently considering and an implementation of a detector for it. Experimental results are presented in Section 3. Conclusions are made in the final section. 2. ARTICULATORY FEATURE BASED APPROACH 2.1. Recognition Framework In this section we describe a framework for our ongoing research toward a speech recognition system based on articulatory features. The framework is illustrated in Figure 1. In the standard top-