CENATAV Voice-Group Systems for Albayzin 2018 Speaker Diarization Evaluation Campaign Edward L. Campbell Hern´ andez, Gabriel Hern ´ andez Sierra , Jos´ e R. Calvo de Lara Voice Group, Advanced Technologies Application Center, CENATAV, Havana, Cuba ecampbell@cenatav.co.cu, gsierra@cenatav.co.cu, jcalvo@cenatav.co.cu Abstract Usually, the environment to record a voice signal is not ideal and, in order to improve the representation of the speaker characteristic space, it is necessary to use a robust algorithm, thus making the representation more stable in the presence of noise. A Diarization system that focuses on the use of robust feature extraction techniques is proposed in this paper. The pre- sented features ( such as Mean Hilbert Envelope Coefficients, Medium Duration Modulation Coefficients and Power Normal- ization Cepstral Coefficients ) were not used in other Albayzin Challenges. These robust techniques have a common charac- teristic, which is the use of a Gammatone filter-bank for divid- ing the voice signal in sub-bands as an alternative option to the classical Triangular filter-bank used in Mel Frequency Cepstral Coefficients. The experiment results show a more stable Di- arization Error Rate in robust features than in classic features. Index Terms: Speaker Diarization, Robust feature extraction, Mean Hilbert Envelope Coefficients, Albayzin 2018 SDC 1. Introduction This is the first participation of the CENATAV Voice Group in the Albayzin Challenges, participating in the Speaker Diariza- tion Challenge (SDC) task and developing a Diarization Sys- tem focuses in robust feature extraction. A Speaker Diarization System allows identifying ” Who spoke when ? ” on an audio stream, which has been of interest for the scientific community since the last century, with the emergence of the first works on speaker segmentation and clustering [1][2]. The diarization can be used as a stage that enriches and improves the results of other systems, for example: a Rich Transcription System uses the di- arization for adding the information about who is speaking to the speech transcription, or a Speaker Recognition System uses it when the test signal has several speakers, so diarization allows finding the segments from the test signal with only one speaker [3]. An issue of the diarization is the environment where the speech is recorded, because noise is a natural condition in real applications. The proposed system is focused on robust feature extraction techniques for improving the results in a real applica- tion. Robust techniques as Mean Hilbert Envelope Coefficients (MHEC), Medium Duration Modulation Coefficients (MDMC) and Power Normalization Cepstral Coefficients (PNCC) are analysed. The system was mainly developed on S4D tool [4], with the following structure: robust feature extraction, segmentation (gaussian divergence and Bayesian Information Coefficient), speech activity detection (Support Vector Machine), clustering (Hierarchical Agglomerative Clustering) and the last stage is the Re-segmentation (Viterbi algorithm). A system description is done in the next sections. 2. Robust Feature Extraction A feature is robust when it has a stable effectiveness both in controlled or uncontrolled environment ( noise, reverberation, etc.), being the second condition the most common in the prac- tice [5], so the use of a feature with this characteristic is rele- vant in real applications. A brief description of several robust feature extraction techniques is provided in this section. These techniques use a gammatone filter-bank (see Fig. 1), the design of which was based on Patterson’s ear model [6], defining the impulse response at the channeli for the equation 1. Figure 1: Gammatone filter-bank of 40 dimension h(t)i = γ * t τ -1 * cos(2π * fc i * t + θ) exp(-2π * erbi * t) , (1) where: • γ: amplitude. • τ : filter order. • erb: equivalent rectangular bandwidth. • fc i : center frequency at the channeli . • θ: phase. The next gammatone filter parameters were set in the pro- posed system, following Glasberg and Moore’s recommenda- tion [6], where: • fc i = -(EarQ * minB)+ (fmax+EarQ*minB) exp (i*0.5)/EarQ • erb =( fc i EarQ τ + minB τ ) 1/τ • EarQ =9.26449 • minB = 24.7 IberSPEECH 2018 21-23 November 2018, Barcelona, Spain 227 10.21437/IberSPEECH.2018-47