Journal of Information Security, 2012, 3, 335-340
doi:10.4236/jis.2012.34041 Published Online October 2012 (http://www.SciRP.org/journal/jis)
Text Independent Automatic Speaker Recognition System
Using Mel-Frequency Cepstrum Coefficient and Gaussian
Mixture Models
Alfredo Maesa
1
, Fabio Garzia
1,2
, Michele Scarpiniti
1
, Roberto Cusani
1
1
Department of Information, Electronics and Telecommunications Engineering, University of Rome, Rome, Italy
2
Wessex Institute of Technology, Southampton, UK
Email: fabio.garzia@uniroma1.it
Received July 21, 2012; revised August 14, 2012; accepted September 3, 2012
ABSTRACT
The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition
(ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in or-
der to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio data-
base, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients
were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were
used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved
by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language,
is about 2 seconds on a common PC.
Keywords: Automatic Speaker Recognition; Access Control; Voice Recognition; Biometrics
1. Introduction
In last decades, an increasing interest in security systems
has arisen. These systems are very useful since they allow
managing security in a very efficient way, reducing the
need of human resources. Most of them implement an
access control system [1-4]. In particular, a huge number
of research efforts were directed to speaker recognition
problem [5-15]. In fact, many strategic places are of vital
importance to the assessment of involved people. A sim-
ple way to verify people identity can consist in analyzing
its voice. In fact, voice based recognition systems repre-
sent biometric systems that allow the access control in a
very fast and low intrusive way, requesting a reduced
collaboration of the people.
The human voice is peculiar to each person and this is
due to the anatomical apparatus of phonation. The vocal
tract consists of three main cavities: the oral cavity, the
nasal cavity and the pharyngeal cavity [16]. The nasal
cavity is essentially bony, hence static in time; further-
more it can be isolated through the soft palate. The oral
cavity is formed by the bony structure of the palate and
soft palate; its conformation can be altered significantly
by the movement of the jaw, lips and tongue. The pharyn-
geal cavity extends to the bottom of the throat and it can
be compressed retracting the base of the tongue towards of
the wall of the pharynx. In the lower part it ends with the
vocal cords: a couple of fleshy membranes traversed by
the air coming from the lungs. During the production of a
sound, the space between the membranes (glottis) can be
completely opened or partially closed.
Due to the peculiarity of the voice formation apparatus,
it can be possible to recognize a particular individual from
its voice. In addition, this operation can be evaluated in an
automatic approach [13-15]. In literature, this problem is
addressed as Automatic Speaker Recognition (ASR) [17],
and it is widely discussed by the research community
[13-15].
Speaker recognition is classified as a hybrid biometric
recognition approach, as it has two components: the
physical one related to the anatomy of the vocal apparatus,
and the behavioral component, pertinent to the mood of
the speaker just in the recording moment [15].
There are several approach to ASR based on features,
vector quantization, score normalization, pattern match-
ing, etc., but the most of them are text dependent [6,7,9-11,
13,14].
In this paper, we propose text independent ASR system
based on Mel-Frequency Cepstrum Coefficients (MFCC)
[18,19] and Gaussian Mixture Models (GMM) [20-22].
Then the model parameters are estimated with the maxi-
mum similarity making use of the Expectation and
Maximization (EM) algorithm [23,24]. The novel com-
Copyright © 2012 SciRes. JIS