THE “SIVA” SPEECH DATABASE FOR SPEAKER VERIFICATION: DESCRIPTION AND EVALUATION Mauro Falcone, Alessandra Gallo Fondazione Ugo Bordoni Via Baldassarre Castiglione 59, 00142 Roma ABSTRACT The description and characterization of the Italian speech database SIVA is given. After a brief review of the available corpora designed for speaker verification task, we introduce the “Speaker Identification and Verification Archives: SIVA”, a database that consists actually of more than two thousands calls, collected over the public switched telephone network. A detailed description of speech material, a proposal for an acoustic characterization, and the performances obtained using a speaker verification reference system are presented and discussed herein after. 1. INTRODUCTION In the last years the importance of corpora for speech technology has been definitively recognized in basic science, as well as in research and development limits. There are now several organizations that distribute speech and lexicon databases: the “Linguistic Data Consortium” (LDC) from the University of Pennsylvania, is the most famous one and distributes almost every American public database; the “European Language Resources Association” (ELRA), founded last year, is going to be the reference point for European countries. Figure 1: The standard speaker verification system may be divided in three main blocks. In order the speech parametrisation block; the pattern matching algorithm block, and last the decision strategy block. These blocks may be controlled by a dialogue manager that controls the input and output of the whole system. Speaker verification (SV) concerns the problem of verifying if a given utterance has been pronounced by the declared authorized speaker or not. The simplest scheme to represent a speaker verification system is shown in figure 1. Authorized speakers are commonly called “users”, while speakers that try to force the system, i.e. to mimic another person’s voice, are called “impostors”. The taxonomy and the exact terminology of speaker recognition (by speaker recognition we indicate every possible task including verification, identification, monitoring, etc.) may be quite difficult, and in any case there is not a definitive agreement on it. A detailed description is given in [1]. The present work only concerns speaker verification problem, in its common understanding, i.e. as previously defined. 2. SPEECH DATABASES FOR SPEAKER VERIFICATION Generally speaking a speech database for speaker verification assessment and evaluation should contain many repetitions of the users’ voice, and few (at least only one) repetitions of the impostors’ voice. It should also contain many impostors’ voice, and (only for practical reasons) a limited number of users’ voice. In addition the speakers’ population should be balanced in gender, age, social extraction, etc. Of course, these are only broad guidelines following a general purpose approach [2]; specific tasks and solutions may require designs of ad hoc databases. 2.1. Available speech databases for SV As speaker recognition has been considered until today just a marginal field of speech technology, there are few public databases on this topic. Nowadays there is an increasing interest in SV, from both service providers and end-users. It is for this reason that in the last few years we had some databases [3] made for speaker verification goals. Here it is a list of the available ones, i.e. the databases utilized in the most important experiments. TIMIT (and NTIMIT). Certainly this is the most famous database. Even if it was designed for speech recognition, it has been widely used also in speaker recognition [4]. Its telephonic version NTIMIT, has a detailed technical description. This is the only case of acoustic description, that we know, and it is devoted to describe the transformation of the original database in a telephone quality speech database. KING. It is the first database designed for speaker verification. It is also famous for the “great-divide”, an effect related to some variations in the acquisition instruments. The effect is described in term of system performance, and not in relation to the characteristics of the speech signal (that is of course a more reasonable and interesting description). It contains monologues by 51 male speakers each divided into 10 sessions per speaker of short 1 minute duration.