Towards Robust Phoneme Classification With Hybrid Features Jibran Yousafzai , Zoran Cvetkovi´ c and Peter Sollich Department of Electronic Engineering and Department of Mathematics King’s College London Abstract—In this paper, we investigate the robustness of phoneme classification to additive noise with hybrid features using support vector machines (SVMs). In particular, the cepstral features are combined with short term energy features of acoustic waveform segments to form a hybrid representation. The energy features are then taken into account separately in the SVM kernel, and a simple subtraction method allows them to be adapted effectively in noise. This hybrid representation contributes significantly to the robustness of phoneme classification and narrows the performance gap to the ideal baseline of classifiers trained under matched noise conditions. Index Terms—Hybrid features, Phoneme classification, Robustness, Support vector machines I. I NTRODUCTION Accuracy of automatic speech recognition (ASR) systems rapidly degrades when operated in adverse acoustical environments. While language and context modelling are essential for reducing many errors in speech recognition, accurate recognition of phonemes and the related problem of classification of isolated phonetic units is a major step towards achieving robust recognition of continuous speech [1, 2]. Indeed, phoneme classification has been the subject of several recent studies [3–6]. State-of-the-art ASR systems use cepstral features, normally some variant of Mel-frequency cepstral coefficients (MFCC) or Perceptual Linear Prediction (PLP) [7], as their front-end for processing of speech signals. These representations are derived from the short term magnitude spectra followed by non-linear transformations to model the processing of the human auditory system and allow for more accurate modelling when data is limited. However, due to the nonlinear processing involved in the feature extraction, even small amounts of additive noise may cause significant departures from the distributions learned on noiseless data. Large amount of training data is required to retrain the system to a new environment. To make the cepstral representations of speech less sensitive to noise, several techniques such as cepstral mean and variance normalization (CMVN) [8] and multi-condition/multi-style training [9, 10] have been proposed to reduce explicitly the effects of noise on spectral representations with the aim of approaching the optimal performance which is achieved when training and testing conditions are matched [11]. State-of-the-art feature compensation methods for the cepstral representation of speech include the ETSI advanced front end (AFE) [12] and vector Taylor series (VTS) [13, 14]. In this work, we propose that a set of hybrid features, formed by combining the standard cepstral features (MFCC) with the short term/local energy features of acoustic waveform segments, can contribute to the robustness of phoneme classification in noise. This is motivated by the fact that the local energy features can then be adapted effectively in noise by taking into account the approximate orthogonality of clean speech and noise. Note that this work is focused on the task of phoneme classification using the hybrid features in the presence of additive noise although we believe the results also have implications for the construction of continuous speech recognition systems. The SVM approach to classification of phonemes using error- correcting output codes (ECOC) [15] is reviewed briefly in Section II. Section III presents the proposed hybrid features and their adaptation in the presence of noise. Experimental setup is discussed in Section IV and classification results in the presence of noise are reported in Section V. Finally, Section VI draws some conclusions. II. CLASSIFICATION METHOD An SVM [16] binary classifier estimates decision surfaces sep- arating two classes of data. In the simplest case these are linear, but for most pattern recognition problems one requires nonlinear decision boundaries. These are constructed using kernels instead of dot products, implicitly mapping data points to high-dimensional feature vectors. A kernel-based decision function which classifies an input vector x is expressed as h(x)= X i αi yi K(x, xi )+ b, (1) where K is a kernel function, xi , yi = ±1 and αi , respectively, are the i-th training sample, its class label and its Lagrange multiplier, and b is the classifier bias determined by the training algorithm. Two commonly used kernels are the polynomial and radial basis function (RBF) kernels given by Kp(x, xi ) = (1 + x, xi ) Θ . (2) Kr (x, ˜ x)= e Γx˜ x 2 . (3) Comparable performance is achieved with both kernels; results are reported for the polynomial kernel throughout this study. SVMs are binary classifiers trained to distinguish between two groups of classes. For multiclass classification, they can be combined via predefined discrete error-correcting output codes (ECOC) [15]. To summarize the procedure briefly, N binary classifiers are trained to distinguish between M classes using the coding matrix WM×N , with elements wmn ∈{0, 1, 1}. Classifier n is trained on data of classes m for which wmn =0 with sgn(wmn) as the class label; it has no knowledge about classes m =1,...,M for which wmn =0. The class m that one predicts for test input x is then the one that maximizes the confidence, ρm(x)= P N n=1 χ(wmnhn(x)). Here χ is some loss function and hn(x) is the output of the n th classifier. The error-correcting capability of a code is commensurate to the minimum Hamming distance between pairs of code words [15]. Therefore, classification performance benefits from using error- correcting codes with larger Hamming distances between their rows. However one must also take into account the choice of accurate binary classifiers and the computational costs associated with such a code. In our previous work [17] on phoneme classification on a subset of the TIMIT database, a code formed by the combination of the one-vs-one (pairwise) and one-vs-all codes was used as this achieved better classification performance than either of the codes individually. A similar technique that implicitly combined the two