Towards Robust Phoneme Classiﬁcation With Hybrid Features Jibran Yousafzai † , Zoran Cvetkovi´ c † and Peter Sollich ‡ Department of Electronic Engineering † and Department of Mathematics ‡ King’s College London Abstract—In this paper, we investigate the robustness of phoneme classiﬁcation to additive noise with hybrid features using support vector machines (SVMs). In particular, the cepstral features are combined with short term energy features of acoustic waveform segments to form a hybrid representation. The energy features are then taken into account separately in the SVM kernel, and a simple subtraction method allows them to be adapted effectively in noise. This hybrid representation contributes signiﬁcantly to the robustness of phoneme classiﬁcation and narrows the performance gap to the ideal baseline of classiﬁers trained under matched noise conditions. Index Terms—Hybrid features, Phoneme classiﬁcation, Robustness, Support vector machines I. I NTRODUCTION Accuracy of automatic speech recognition (ASR) systems rapidly degrades when operated in adverse acoustical environments. While language and context modelling are essential for reducing many errors in speech recognition, accurate recognition of phonemes and the related problem of classiﬁcation of isolated phonetic units is a major step towards achieving robust recognition of continuous speech [1, 2]. Indeed, phoneme classiﬁcation has been the subject of several recent studies [3–6]. State-of-the-art ASR systems use cepstral features, normally some variant of Mel-frequency cepstral coefﬁcients (MFCC) or Perceptual Linear Prediction (PLP) [7], as their front-end for processing of speech signals. These representations are derived from the short term magnitude spectra followed by non-linear transformations to model the processing of the human auditory system and allow for more accurate modelling when data is limited. However, due to the nonlinear processing involved in the feature extraction, even small amounts of additive noise may cause signiﬁcant departures from the distributions learned on noiseless data. Large amount of training data is required to retrain the system to a new environment. To make the cepstral representations of speech less sensitive to noise, several techniques such as cepstral mean and variance normalization (CMVN) [8] and multi-condition/multi-style training [9, 10] have been proposed to reduce explicitly the effects of noise on spectral representations with the aim of approaching the optimal performance which is achieved when training and testing conditions are matched [11]. State-of-the-art feature compensation methods for the cepstral representation of speech include the ETSI advanced front end (AFE) [12] and vector Taylor series (VTS) [13, 14]. In this work, we propose that a set of hybrid features, formed by combining the standard cepstral features (MFCC) with the short term/local energy features of acoustic waveform segments, can contribute to the robustness of phoneme classiﬁcation in noise. This is motivated by the fact that the local energy features can then be adapted effectively in noise by taking into account the approximate orthogonality of clean speech and noise. Note that this work is focused on the task of phoneme classiﬁcation using the hybrid features in the presence of additive noise although we believe the results also have implications for the construction of continuous speech recognition systems. The SVM approach to classiﬁcation of phonemes using error- correcting output codes (ECOC) [15] is reviewed brieﬂy in Section II. Section III presents the proposed hybrid features and their adaptation in the presence of noise. Experimental setup is discussed in Section IV and classiﬁcation results in the presence of noise are reported in Section V. Finally, Section VI draws some conclusions. II. CLASSIFICATION METHOD An SVM [16] binary classiﬁer estimates decision surfaces sep- arating two classes of data. In the simplest case these are linear, but for most pattern recognition problems one requires nonlinear decision boundaries. These are constructed using kernels instead of dot products, implicitly mapping data points to high-dimensional feature vectors. A kernel-based decision function which classiﬁes an input vector x is expressed as h(x)= X i αi yi K(x, xi )+ b, (1) where K is a kernel function, xi , yi = ±1 and αi , respectively, are the i-th training sample, its class label and its Lagrange multiplier, and b is the classiﬁer bias determined by the training algorithm. Two commonly used kernels are the polynomial and radial basis function (RBF) kernels given by Kp(x, xi ) = (1 + 〈x, xi 〉) Θ . (2) Kr (x, ˜ x)= e −Γ‖x−˜ x‖ 2 . (3) Comparable performance is achieved with both kernels; results are reported for the polynomial kernel throughout this study. SVMs are binary classiﬁers trained to distinguish between two groups of classes. For multiclass classiﬁcation, they can be combined via predeﬁned discrete error-correcting output codes (ECOC) [15]. To summarize the procedure brieﬂy, N binary classiﬁers are trained to distinguish between M classes using the coding matrix WM×N , with elements wmn ∈{0, 1, −1}. Classiﬁer n is trained on data of classes m for which wmn =0 with sgn(wmn) as the class label; it has no knowledge about classes m =1,...,M for which wmn =0. The class m that one predicts for test input x is then the one that maximizes the conﬁdence, ρm(x)= − P N n=1 χ(wmnhn(x)). Here χ is some loss function and hn(x) is the output of the n th classiﬁer. The error-correcting capability of a code is commensurate to the minimum Hamming distance between pairs of code words [15]. Therefore, classiﬁcation performance beneﬁts from using error- correcting codes with larger Hamming distances between their rows. However one must also take into account the choice of accurate binary classiﬁers and the computational costs associated with such a code. In our previous work [17] on phoneme classiﬁcation on a subset of the TIMIT database, a code formed by the combination of the one-vs-one (pairwise) and one-vs-all codes was used as this achieved better classiﬁcation performance than either of the codes individually. A similar technique that implicitly combined the two