LOW-COMPLEXITY AUTOMATIC SPEAKER RECOGNITION
IN THE COMPRESSED GSM AMR DOMAIN
M. Petracca, A. Servetti, J.C. De Martin
1
Dipartimento di Automatica e Informatica / IEIIT-CNR
1
Politecnico di Torino
Corso Duca degli Abruzzi, 24 — I-10129 Torino, Italy
E-mail: [matteo.petracca|servetti|demartin]@polito.it
ABSTRACT
This paper presents an experimental implementation of a
low-complexity speaker recognition algorithm working in
the compressed speech domain. The goal is to perform
speaker modeling and identication without decoding the
speech bitstream to extract speaker dependent features, thus
saving important system resources, for instance, in mobile
devices. The compressed bitstream values of the widely
used GSM AMR speech coding standard are studied to
identify statistics enabling fair recognition after a few sec-
onds of speech. Using euclidean distance measures on el-
ementary statistical values such as coefficient of variation
and skewness of nine standard GSM AMR parameters de-
livers recognition accuracies close to 100% after about 20
seconds of active speech for a database of 14 speakers
recorded in a normal room environment.
1. INTRODUCTION
Automatic speaker recognition (ASR) has been a research
topic for many years, during which various types of ap-
proaches, with increasing level of complexity and perfor-
mance, have been studied. The general method to ASR con-
sists of three steps: speech data acquisition, feature extrac-
tion, and pattern matching. Feature extraction maps speech
intervals to a multidimensional feature space. Speaker
identification is, then, performed comparing this sequence
of feature vectors to available speaker models by pattern
matching. State of the art speaker recognition systems typ-
ically use mel-frequency cepstral coefficients (MFCC) and
Gaussian mixture models (GMM) for speaker modeling [1].
In recent years, due to the widespread use of digital
speech communication systems, there has been increasing
interest in the performance of recognition systems from
coded speech. The effect of speech coding on speaker and
The work was supported in part by the Motorola PCS Research Center
in Torino, Italy.
language recognition tasks has been investigated for sev-
eral coders and a wide range of bit rates [2]. The typical
approach consists of extraction of the speech features from
the decoded speech signal. This paper, instead, explores
the possibility of working directly in the speech domain so
that no decoding is needed, thus lowering the processing
and memory requirements with respect to the standard ap-
proach.
We investigate the recognition accuracy achievable us-
ing medium-term statistical analysis of the coded bitstream
to produce a feature set and a speaker model useful for
speaker recognition. We focus on limiting the complex-
ity of our model to the second and third order statistic of
a few parameters, thus requiring just a fraction of the mem-
ory storage and processing power needed by systems based
on GMMs. The proposed system is therefore targeted at ap-
plications that are allowed to identify speakers after a few
seconds of active speech.
The structure of this paper is as follows. In Section 2 we
investigate the statistical properties of coded speech param-
eters as speaker-dependent features. The recognition exper-
iments and performance results are presented in Section 3
for a speech corpora with fourteen male and female speak-
ers. Conclusions follow in Section 4.
2. SPEAKER-DEPENDENT INFORMATION IN
THE GSM BITSTREAM
When a person speaks, he or she produces a set of signal fea-
tures that characterize both the identity of the utterance as
well as that of the speaker. In the literature there have been
several studies on the choice of acoustic features in speaker
recognition tasks [3]. Average fundamental frequency has
been found to be a useful discriminating feature, as have
gain measurements and long-term speech spectra. All these
features are physically-based distinguishing characteristics
related to the human speech production system.
In the approach under investigation, the feature space
0-7803-9332-5/05/$20.00 ©2005 IEEE