LOW-COMPLEXITY AUTOMATIC SPEAKER RECOGNITION IN THE COMPRESSED GSM AMR DOMAIN M. Petracca, A. Servetti, J.C. De Martin 1 Dipartimento di Automatica e Informatica / IEIIT-CNR 1 Politecnico di Torino Corso Duca degli Abruzzi, 24 — I-10129 Torino, Italy E-mail: [matteo.petracca|servetti|demartin]@polito.it ABSTRACT This paper presents an experimental implementation of a low-complexity speaker recognition algorithm working in the compressed speech domain. The goal is to perform speaker modeling and identication without decoding the speech bitstream to extract speaker dependent features, thus saving important system resources, for instance, in mobile devices. The compressed bitstream values of the widely used GSM AMR speech coding standard are studied to identify statistics enabling fair recognition after a few sec- onds of speech. Using euclidean distance measures on el- ementary statistical values such as coefﬁcient of variation and skewness of nine standard GSM AMR parameters de- livers recognition accuracies close to 100% after about 20 seconds of active speech for a database of 14 speakers recorded in a normal room environment. 1. INTRODUCTION Automatic speaker recognition (ASR) has been a research topic for many years, during which various types of ap- proaches, with increasing level of complexity and perfor- mance, have been studied. The general method to ASR con- sists of three steps: speech data acquisition, feature extrac- tion, and pattern matching. Feature extraction maps speech intervals to a multidimensional feature space. Speaker identiﬁcation is, then, performed comparing this sequence of feature vectors to available speaker models by pattern matching. State of the art speaker recognition systems typ- ically use mel-frequency cepstral coefﬁcients (MFCC) and Gaussian mixture models (GMM) for speaker modeling [1]. In recent years, due to the widespread use of digital speech communication systems, there has been increasing interest in the performance of recognition systems from coded speech. The effect of speech coding on speaker and The work was supported in part by the Motorola PCS Research Center in Torino, Italy. language recognition tasks has been investigated for sev- eral coders and a wide range of bit rates [2]. The typical approach consists of extraction of the speech features from the decoded speech signal. This paper, instead, explores the possibility of working directly in the speech domain so that no decoding is needed, thus lowering the processing and memory requirements with respect to the standard ap- proach. We investigate the recognition accuracy achievable us- ing medium-term statistical analysis of the coded bitstream to produce a feature set and a speaker model useful for speaker recognition. We focus on limiting the complex- ity of our model to the second and third order statistic of a few parameters, thus requiring just a fraction of the mem- ory storage and processing power needed by systems based on GMMs. The proposed system is therefore targeted at ap- plications that are allowed to identify speakers after a few seconds of active speech. The structure of this paper is as follows. In Section 2 we investigate the statistical properties of coded speech param- eters as speaker-dependent features. The recognition exper- iments and performance results are presented in Section 3 for a speech corpora with fourteen male and female speak- ers. Conclusions follow in Section 4. 2. SPEAKER-DEPENDENT INFORMATION IN THE GSM BITSTREAM When a person speaks, he or she produces a set of signal fea- tures that characterize both the identity of the utterance as well as that of the speaker. In the literature there have been several studies on the choice of acoustic features in speaker recognition tasks [3]. Average fundamental frequency has been found to be a useful discriminating feature, as have gain measurements and long-term speech spectra. All these features are physically-based distinguishing characteristics related to the human speech production system. In the approach under investigation, the feature space 0-7803-9332-5/05/$20.00 ©2005 IEEE