Speaker identification using phonetic segmentation and normalized relative delays of source harmonics ∗ Diana Mendes 1 , An´ ıbal Ferreira 1 1 Universidade do Porto - Faculdade de Engenharia, Rua Dr. Roberto Frias, s/n, 4200-465, Porto, Portugal Correspondence should be addressed to An´ ıbal Ferreira (ajf@fe.up.pt) ABSTRACT Current state-of-the-art speaker identification systems achieve high performances in reasonably well con- trolled conditions. However, some scenarios still elicit significant challenges, particularly in audio forensics when voice records are typically just a few seconds long and are severely affected by distortion, interferences, and abnormal speaking attitudes. In this paper we are inspired by the concept of minutiae in the context of fingerprinting, and try to extract localized, phase-related singularities from the speech signal denoting glottal source idiosyncratic information. First, we perform MFCC+GMM experiments in order to find the most effective phonetic segmentation of the speech signal for speaker modelling and discrimination. Secondly, we rely on effective phonetic segmentation and, in addition to MFCC features, we extract Normalized Relative Delays (NRDs) obtained from the phase of spectral harmonics. We use a Nearest Neighbour generalized classifier for speaker modelling and identification. Our results indicate that combining a careful phonetic seg- mentation and the inclusion of phase-related information, performance in speaker identification may increase significantly. 1. INTRODUCTION During the last few decades the practical importance of biometric systems [1] has increased significantly, and to- day we can find mature technology namely in the area of image analysis involving fingerprinting [2] and iris pat- tern recognition. However, biometric systems based on voice analysis are not widely deployed [3]. Although a significant amount of research work has been carried out improving the performance of current voice iden- tification or verification systems [4], the reality shows that these systems are highly dependent on signal acqui- sition conditions, namely the microphone, the acoustics of the environment, signal alterations due to the commu- nication channel, and spurious interferences due to the overlap of multiple acoustic events, including multiple voice signals. For these reasons, practical applications ∗ This work was supported by the Portuguese Foun- dation for Science and Technology, an agency of the Portuguese Ministry for Education and Science, un- der research project PTDC/SAU-BEB/14995/2008. URL: http://gnomo.fe.up.pt/∼voicestudies/artts/, last ac- cessed on March 31st 2012. of voice-based biometry can be found mainly in contexts were those factors are reasonably well controlled, such as home-banking. Typically, in the area of audio forensics, voice data is highly corrupted with noise and interferences, is pro- duced under altered emotional or behavioural conditions and, especially, voice records are of very short duration such as just two or three seconds long. In these cases, statistical-based voice modelling such as Gaussian Mix- ture Models (GMMs) can not simply be applied because the amount of voice data is clearly insufficient to pro- duce representative GMM models. As an alternative, we are inspired by the concept of minutiae in the context of fingerprinting recognition [2] and look for opportuni- ties to identify single occurrences or singularities in the signal that might denote the unique sound signature of a specific voice, either audible or not by a human. The im- portance and even competitiveness of phase-related fea- tures in speaker identification has been shown recently by at least two independent research studies [5, 6]. These studies highlight that carefully chosen phase-related fea- tures exhibit a discrimination capability which is compa- AES 46 TH INTERNATIONAL CONFERENCE, Denver, USA, 2012 June 14–16 1