CENATAV Voice-Group Systems for Albayzin 2018 Speaker Diarization Evaluation Campaign Edward L. Campbell Hern´ andez, Gabriel Hern ´ andez Sierra , Jos´ e R. Calvo de Lara Voice Group, Advanced Technologies Application Center, CENATAV, Havana, Cuba ecampbell@cenatav.co.cu, gsierra@cenatav.co.cu, jcalvo@cenatav.co.cu Abstract Usually, the environment to record a voice signal is not ideal and, in order to improve the representation of the speaker characteristic space, it is necessary to use a robust algorithm, thus making the representation more stable in the presence of noise. A Diarization system that focuses on the use of robust feature extraction techniques is proposed in this paper. The pre- sented features ( such as Mean Hilbert Envelope Coefﬁcients, Medium Duration Modulation Coefﬁcients and Power Normal- ization Cepstral Coefﬁcients ) were not used in other Albayzin Challenges. These robust techniques have a common charac- teristic, which is the use of a Gammatone ﬁlter-bank for divid- ing the voice signal in sub-bands as an alternative option to the classical Triangular ﬁlter-bank used in Mel Frequency Cepstral Coefﬁcients. The experiment results show a more stable Di- arization Error Rate in robust features than in classic features. Index Terms: Speaker Diarization, Robust feature extraction, Mean Hilbert Envelope Coefﬁcients, Albayzin 2018 SDC 1. Introduction This is the ﬁrst participation of the CENATAV Voice Group in the Albayzin Challenges, participating in the Speaker Diariza- tion Challenge (SDC) task and developing a Diarization Sys- tem focuses in robust feature extraction. A Speaker Diarization System allows identifying ” Who spoke when ? ” on an audio stream, which has been of interest for the scientiﬁc community since the last century, with the emergence of the ﬁrst works on speaker segmentation and clustering [1][2]. The diarization can be used as a stage that enriches and improves the results of other systems, for example: a Rich Transcription System uses the di- arization for adding the information about who is speaking to the speech transcription, or a Speaker Recognition System uses it when the test signal has several speakers, so diarization allows ﬁnding the segments from the test signal with only one speaker [3]. An issue of the diarization is the environment where the speech is recorded, because noise is a natural condition in real applications. The proposed system is focused on robust feature extraction techniques for improving the results in a real applica- tion. Robust techniques as Mean Hilbert Envelope Coefﬁcients (MHEC), Medium Duration Modulation Coefﬁcients (MDMC) and Power Normalization Cepstral Coefﬁcients (PNCC) are analysed. The system was mainly developed on S4D tool [4], with the following structure: robust feature extraction, segmentation (gaussian divergence and Bayesian Information Coefﬁcient), speech activity detection (Support Vector Machine), clustering (Hierarchical Agglomerative Clustering) and the last stage is the Re-segmentation (Viterbi algorithm). A system description is done in the next sections. 2. Robust Feature Extraction A feature is robust when it has a stable effectiveness both in controlled or uncontrolled environment ( noise, reverberation, etc.), being the second condition the most common in the prac- tice [5], so the use of a feature with this characteristic is rele- vant in real applications. A brief description of several robust feature extraction techniques is provided in this section. These techniques use a gammatone ﬁlter-bank (see Fig. 1), the design of which was based on Patterson’s ear model [6], deﬁning the impulse response at the channeli for the equation 1. Figure 1: Gammatone ﬁlter-bank of 40 dimension h(t)i = γ * t τ -1 * cos(2π * fc i * t + θ) exp(-2π * erbi * t) , (1) where: • γ: amplitude. • τ : ﬁlter order. • erb: equivalent rectangular bandwidth. • fc i : center frequency at the channeli . • θ: phase. The next gammatone ﬁlter parameters were set in the pro- posed system, following Glasberg and Moore’s recommenda- tion [6], where: • fc i = -(EarQ * minB)+ (fmax+EarQ*minB) exp (i*0.5)/EarQ • erb =( fc i EarQ τ + minB τ ) 1/τ • EarQ =9.26449 • minB = 24.7 IberSPEECH 2018 21-23 November 2018, Barcelona, Spain 227 10.21437/IberSPEECH.2018-47