SABR: Sparse, Anchor-Based Representation of the Speech Signal Christopher Liberatore, Sandesh Aryal, Zelun Wang, Seth Polsley, Ricardo Gutierrez-Osuna Department of Computer Science and Engineering, Texas A&M University, United States {cliberatore, saryal, wang14359, spolsley, rgutier}@tamu.edu Abstract We present SABR (Sparse, Anchor-Based Representation), an analysis technique to decompose the speech signal into speaker-dependent and speaker-independent components. Given a collection of utterances for a particular speaker, SABR uses the centroid for each phoneme as an acoustic “anchor,” then applies Lasso regularization to represent each speech frame as a sparse non-negative combination of the anchors. We illustrate the performance of the method on a speaker-independent phoneme recognition task and a voice conversion task. Using a linear classifier, SABR weights achieve significantly higher phoneme recognition rates than Mel frequency Cepstral coefficients. SABR weights can also be used directly to perform accent conversion without the need to train a speaker- to-speaker regression model. Index Terms: speech analysis, voice conversion, speaker independent representation, auditory phonetics, sparse coding 1. Introduction Across multiple speech problems, there is a need to separate linguistic information from speaker dependent cues in the speak signal. As an example, in automatic speech recognition (ASR), speaker variability is viewed as unwanted noise in the signal of interest (i.e. linguistic content), whereas in voice conversion one seeks to modify speaker-dependent cues while retaining the linguistic content of the utterances. Unfortunately, separating these sources of information in the speech signal is a challenging task, mainly due to their complex interaction in the spectral domain [1]. Several techniques have been developed to remove physiological influences in speech. In the classical source- filter model [2] the speech signal is decomposed into source excitation, which captures the speaker’s glottal characteristics, and spectral envelope. Though the spectral envelope captures (primarily) the phonetic content of the utterance, it also contains speaker-dependent information (e.g. vocal tract length). If speech recognition is the goal, vocal tract length normalization [3, 4] and speaker adaptation [5, 6] can be very effective in removing speaker dependencies from the spectral envelope, but these techniques cannot be used for source separation. This paper presents SABR (Sparse, Anchor-Based Representation), an analysis technique that decomposes the speech signal into a set of speaker-dependent acoustic anchors and a complementary set of speaker-independent interpolation weights. Specifically, SABR uses Lasso regression [7] to approximate each acoustic frame as a sparse, non-negative linear combination of acoustic anchors. As we will show, by selecting the phoneme centroids of each speaker as anchors the resulting weights become speaker-independent. We illustrate the ability of the model to separate speaker and linguistic information on two independent problems. First, we show that SABR weights outperform conventional spectral features (MFCCs) on a speaker-independent phoneme discrimination problem. Second, we show that, by combining SABR weights derived from a source speaker with acoustic anchors from a target speaker, our technique can be used as a low-cost voice conversion methodone that does not require training a specific model for each source-target pair. The rest of the paper is organized as follows. Section 2 reviews recent work on speech representations and their applications. Section 3 presents the SABR model and how to use its components for voice conversion and speech recognition applications. Section 4 provides details on the corpus and acoustic features used to evaluate the model, whereas section 5 presents experimental results on phonetic classification and voice conversion (subjective and objective comparison). The article concludes by discussing the implications of the results, future improvements to the method and its potential application to other speech areas. 2. Literature review In speech recognition, a recent approach to remove unwanted speaker-specific variations is to map acoustics into the articulatory feature space. As an example, Frankel et al. [8] trained multi-layer perceptrons to estimate phonological articulatory features (e.g. place, manner, nasality, etc.) from the PLP cepstum. When they combined the estimated articulatory features with acoustic features, word error rate dropped from 67.7% to 59.7% in a speaker-independent phoneme classification task. Similarly, Arora and Livescu [9] used canonical correlation analysis (CCA) of simultaneous acoustic and articulatory recordings to capture the common factor (i.e. linguistic content) in these two views. The authors learned CCA transforms from a group of speakers and used them to extract linguistic features from acoustics in a speaker- independent fashion. CCA features improved the accuracy by 10-23% in a speaker-independent phoneme recognition task. Articulatory features have also been used as speaker- independent representations for speech synthesis and voice conversion, but this involves building a speaker-specific mapping from articulatory features to acoustics. For example, Bollepali et al. [10] developed a speaker-specific encoder (i.e., articulatory inversion) to map acoustic features to phonological articulatory features and a decoder (i.e., forward mapping) for reverse mapping, then used the source encoder to estimate articulatory features from source utterances and decoded back to the target’s acoustics using the target’s decoder. Subjective tests indicated the method was successful in matching the target speaker’s voice identity. In contrast, the proposed anchor-based representation in this paper obviates Copyright 2015 ISCA September 6 - 10, 2015, Dresden, Germany INTERSPEECH 2015 608 10.21437/Interspeech.2015-213