Phoneme Alignment Based on Discriminative Learning Joseph Keshet Hebrew University jkeshet@cs.huji.ac.il Shai Shalev-Shwartz Hebrew University shais@cs.huji.ac.il Yoram Singer Google Inc. singer@google.com Dan Chazan IBM Haifa Labs chazan@il.ibm.com Abstract We propose a new paradigm for aligning a phoneme sequence of a speech utterance with its acoustical signal counterpart. In con- trast to common HMM-based approaches, our method employs a discriminative learning procedure in which the learning phase is tightly coupled with the alignment task at hand. The align- ment function we devise is based on mapping the input acoustic- symbolic representations of the speech utterance along with the target alignment into an abstract vector space. We suggest a specific mapping into the abstract vector-space which utilizes standard speech features (e.g. spectral distances) as well as con- fidence outputs of a framewise phoneme classifier. Building on techniques used for large margin methods for predicting whole sequences, our alignment function distills to a classifier in the abstract vector-space which separates correct alignments from incorrect ones. We describe a simple iterative algorithm for learning the alignment function and discuss its formal proper- ties. Experiments with the TIMIT corpus show that our method outperforms the current state-of-the-art approaches. 1. Introduction Phoneme alignment is the task of proper positioning of a se- quence of phonemes in relation to a corresponding continuous speech signal. This problem is also referred to as phoneme seg- mentation. An accurate and fast alignment procedure is a neces- sary tool for developing speech recognition and text-to-speech systems. Most previous work on phoneme alignment has focused on a generative model of the speech signal using Hidden Markov Models (HMM). See for example [1, 6, 14] and the references therein. Despite their popularity, HMM-based approaches have several drawbacks such as convergence of the EM procedure to local maxima and overfitting effects due to the large number of parameters. In this paper we propose an alternative approach for phoneme alignment that builds upon recent work on discrimina- tive supervised learning. The advantage of discriminative learn- ing algorithms stems from the fact that the objective function used during the learning phase is tightly coupled with the deci- sion task one needs to perform. In addition, there is both theo- retical and empirical evidence that discriminative learning algo- rithms are likely to outperform generative models for the same task (cf. [15, 4]). One of the best known discriminative learn- ing algorithms is the support vector machine (SVM), which has been successfully applied in speech applications [11, 7, 9]. The classical SVM algorithm is designed for simple decision tasks such as binary classification and regression. Hence, its exploita- tion in speech systems so far has also been restricted to simple decision tasks such as phoneme classification. The phoneme alignment problem is more involved, since we need to predict a sequence of phoneme start times rather than a single number. The main challenge of this paper is to extend the notion of dis- criminative learning to the complex task of phoneme alignment. Our proposed method is based on recent advances in kernel machines and large margin classifiers for sequences [13, 12], which in turn build on the pioneering work of Vapnik and col- leagues [15, 4]. The alignment function we devise is based on mapping the speech signal and its phoneme representation along with the target alignment into an abstract vector-space. Building on techniques used for learning SVMs, our align- ment function distills to a classifier in this vector-space which is aimed at separating correct alignments from incorrect ones. We describe a simple iterative algorithm for learning the align- ment function and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the best performing HMM-based approach [1]. This paper is organized as follows. In Sec. 2 we formally in- troduce the phoneme alignment problem. Our specific learning method is then described in Sec. 3. Next, we present experi- mental results in Sec. 4. Finally, concluding remarks and future directions are discussed in Sec. 5. 2. Problem Setting In this section we formally describe the alignment problem. We denote scalars using lower case Latin letters (e.g. x), and vec- tors using bold face letters (e.g. x). A sequence of elements is designated by a bar ( ¯ x) and its length is denoted as | ¯ x|. In the alignment problem, we are given a speech utterance along with a phonetic representation of the utterance. Our goal is to generate an alignment between the speech signal and the phonetic representation. The Mel-frequency cepstrum coeffi- cients (MFCC) along with their first and second derivatives are extracted from the speech in the standard way which is based on the ETSI standard for distributed speech recognition. We denote the domain of the acoustic feature vectors by X⊂ R d . The acoustic feature representation of a speech signal is there- fore a sequence of vectors ¯ x =(x1,..., xT ), where xt ∈X for all 1 ≤ t ≤ T . A phonetic representation of an utterance is defined as a string of phoneme symbols. Formally, we de- note each phoneme by p ∈P , where P is the set of 48 English American phoneme symbols as proposed by [8]. Therefore, a phonetic representation of a speech utterance consists of a se- quence of phoneme values ¯ p =(p1,...,p k ). Note that the number of phonemes clearly varies from one utterance to an- other and thus k is not fixed. We denote by P ⋆ (and similarly X ⋆ ) the set of all finite-length sequences over P . In summary, an alignment input is a pair (¯ x, ¯ p) where ¯ x is an acoustic rep- resentation of the speech signal and ¯ p is a phonetic representa- tion of the same signal. An alignment between the acoustic and phonetic representations of a spoken utterance is a sequence of start-times ¯ y =(y1,...,y k ) where yi ∈ N is the start-time (measured as frame number) of phoneme i in the acoustic sig-