Journal of Information Security, 2012, 3, 335-340 doi:10.4236/jis.2012.34041 Published Online October 2012 (http://www.SciRP.org/journal/jis) Text Independent Automatic Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and Gaussian Mixture Models Alfredo Maesa 1 , Fabio Garzia 1,2 , Michele Scarpiniti 1 , Roberto Cusani 1 1 Department of Information, Electronics and Telecommunications Engineering, University of Rome, Rome, Italy 2 Wessex Institute of Technology, Southampton, UK Email: fabio.garzia@uniroma1.it Received July 21, 2012; revised August 14, 2012; accepted September 3, 2012 ABSTRACT The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in or- der to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio data- base, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language, is about 2 seconds on a common PC. Keywords: Automatic Speaker Recognition; Access Control; Voice Recognition; Biometrics 1. Introduction In last decades, an increasing interest in security systems has arisen. These systems are very useful since they allow managing security in a very efficient way, reducing the need of human resources. Most of them implement an access control system [1-4]. In particular, a huge number of research efforts were directed to speaker recognition problem [5-15]. In fact, many strategic places are of vital importance to the assessment of involved people. A sim- ple way to verify people identity can consist in analyzing its voice. In fact, voice based recognition systems repre- sent biometric systems that allow the access control in a very fast and low intrusive way, requesting a reduced collaboration of the people. The human voice is peculiar to each person and this is due to the anatomical apparatus of phonation. The vocal tract consists of three main cavities: the oral cavity, the nasal cavity and the pharyngeal cavity [16]. The nasal cavity is essentially bony, hence static in time; further- more it can be isolated through the soft palate. The oral cavity is formed by the bony structure of the palate and soft palate; its conformation can be altered significantly by the movement of the jaw, lips and tongue. The pharyn- geal cavity extends to the bottom of the throat and it can be compressed retracting the base of the tongue towards of the wall of the pharynx. In the lower part it ends with the vocal cords: a couple of fleshy membranes traversed by the air coming from the lungs. During the production of a sound, the space between the membranes (glottis) can be completely opened or partially closed. Due to the peculiarity of the voice formation apparatus, it can be possible to recognize a particular individual from its voice. In addition, this operation can be evaluated in an automatic approach [13-15]. In literature, this problem is addressed as Automatic Speaker Recognition (ASR) [17], and it is widely discussed by the research community [13-15]. Speaker recognition is classified as a hybrid biometric recognition approach, as it has two components: the physical one related to the anatomy of the vocal apparatus, and the behavioral component, pertinent to the mood of the speaker just in the recording moment [15]. There are several approach to ASR based on features, vector quantization, score normalization, pattern match- ing, etc., but the most of them are text dependent [6,7,9-11, 13,14]. In this paper, we propose text independent ASR system based on Mel-Frequency Cepstrum Coefficients (MFCC) [18,19] and Gaussian Mixture Models (GMM) [20-22]. Then the model parameters are estimated with the maxi- mum similarity making use of the Expectation and Maximization (EM) algorithm [23,24]. The novel com- Copyright © 2012 SciRes. JIS