Speaker identiﬁcation using phonetic segmentation and normalized relative delays of source harmonics ∗ Diana Mendes 1 , An´ ıbal Ferreira 1 1 Universidade do Porto - Faculdade de Engenharia, Rua Dr. Roberto Frias, s/n, 4200-465, Porto, Portugal Correspondence should be addressed to An´ ıbal Ferreira (ajf@fe.up.pt) ABSTRACT Current state-of-the-art speaker identiﬁcation systems achieve high performances in reasonably well con- trolled conditions. However, some scenarios still elicit signiﬁcant challenges, particularly in audio forensics when voice records are typically just a few seconds long and are severely aﬀected by distortion, interferences, and abnormal speaking attitudes. In this paper we are inspired by the concept of minutiae in the context of ﬁngerprinting, and try to extract localized, phase-related singularities from the speech signal denoting glottal source idiosyncratic information. First, we perform MFCC+GMM experiments in order to ﬁnd the most eﬀective phonetic segmentation of the speech signal for speaker modelling and discrimination. Secondly, we rely on eﬀective phonetic segmentation and, in addition to MFCC features, we extract Normalized Relative Delays (NRDs) obtained from the phase of spectral harmonics. We use a Nearest Neighbour generalized classiﬁer for speaker modelling and identiﬁcation. Our results indicate that combining a careful phonetic seg- mentation and the inclusion of phase-related information, performance in speaker identiﬁcation may increase signiﬁcantly. 1. INTRODUCTION During the last few decades the practical importance of biometric systems [1] has increased signiﬁcantly, and to- day we can ﬁnd mature technology namely in the area of image analysis involving ﬁngerprinting [2] and iris pat- tern recognition. However, biometric systems based on voice analysis are not widely deployed [3]. Although a signiﬁcant amount of research work has been carried out improving the performance of current voice iden- tiﬁcation or veriﬁcation systems [4], the reality shows that these systems are highly dependent on signal acqui- sition conditions, namely the microphone, the acoustics of the environment, signal alterations due to the commu- nication channel, and spurious interferences due to the overlap of multiple acoustic events, including multiple voice signals. For these reasons, practical applications ∗ This work was supported by the Portuguese Foun- dation for Science and Technology, an agency of the Portuguese Ministry for Education and Science, un- der research project PTDC/SAU-BEB/14995/2008. URL: http://gnomo.fe.up.pt/∼voicestudies/artts/, last ac- cessed on March 31st 2012. of voice-based biometry can be found mainly in contexts were those factors are reasonably well controlled, such as home-banking. Typically, in the area of audio forensics, voice data is highly corrupted with noise and interferences, is pro- duced under altered emotional or behavioural conditions and, especially, voice records are of very short duration such as just two or three seconds long. In these cases, statistical-based voice modelling such as Gaussian Mix- ture Models (GMMs) can not simply be applied because the amount of voice data is clearly insufﬁcient to pro- duce representative GMM models. As an alternative, we are inspired by the concept of minutiae in the context of ﬁngerprinting recognition [2] and look for opportuni- ties to identify single occurrences or singularities in the signal that might denote the unique sound signature of a speciﬁc voice, either audible or not by a human. The im- portance and even competitiveness of phase-related fea- tures in speaker identiﬁcation has been shown recently by at least two independent research studies [5, 6]. These studies highlight that carefully chosen phase-related fea- tures exhibit a discrimination capability which is compa- AES 46 TH INTERNATIONAL CONFERENCE, Denver, USA, 2012 June 14–16 1