EFFECTS OF DEVICE MISMATCH, LANGUAGE MISMATCH AND ENVIRONMENTAL MISMATCH ON SPEAKER VERIFICATION Bin Ma 1 , Helen M. Meng 1 and Man-Wai Mak 2 1 Dept. of Systems Engineering and Engineering Management, The Chinese Univerisity of Hong Kong 2 Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University 1 {bma, hmmeng}@se.cuhk.edu.hk, 2 enmwmak@polyu.edu.hk ABSTRACT Device, language and environmental mismatch adversely affect speaker verification (SV) performance. We investi- gate such effects empirically based on the M3 (multi- biometric, multilingual and multi-device) Corpus [1]. De- vice mismatch (among 3G phone, PocketPC and a desktop PC plug-in microphone) brings relative performance degra- dation of 523%; language mismatch (between English and Cantonese) brings 284% and environmental mismatch (be- tween office environment and recording studio) brings 109%. In particular, verification with wide-band models on narrow-band test data outperforms narrow-band models on wide-band test data. The 3G phone’s SV performance is generally low, but remains stable across environments. Ad- ditionally, durational variations within two-second utter- ances may cause a relative change of 633% in SV perform- ance. Index Terms— Speaker verification, biometrics corpus, M3 speaker verification evaluation 1. INTRODUCTION Speaker verification is the process of authenticating the speaker’s claimed identity based on his/her input utterances. This technology plays a key role in securing computing for human-centric computer interfaces. In real-time applica- tions, the proliferation of mobile, handheld devices present challenges for speaker verification. For example, mobile use means that speaker verification technically needs to handle a variety of environmental conditions. Also, differ- ent audio input devices (e.g., microphones on PDAs or cell- phones) may induce significant variations in the quality of captured speech. Some techniques, such as feature mapping [2], speaker model synthesis [3] and handset normalization [4], have been proposed to alleviate this problem. The lan- guage uttered may also affect SV performance, as demon- strated in our previous work [5]. The length of testing ut- terance segments is another factor affecting SV perform- ance. In particular, it has been shown that the EER of SV system is exponentially related to the length of test segment [6]. The current study attempts to qualify such effects based on SV experiments with the M3 speech data, which contains multilingual, multi-device and data for mobile use, as will be elaborated later. 2. THE SPEECH DATA OF M3 CORPUS The M3 corpus is designed to support research in multi- biometric technologies for pervasive computing using mo- bile devices. Three kinds of biometrics, three devices, as well as three languages, are included in M3. Our research focuses on the speech data in M3. A brief introduction to M3 speech data is presented in this section. 2.1 Speech data collection setup During data collection, the multilingual speech data are cap- tured from multiple devices from two recording conditions: an open laboratory and a recording room. The devices in- clude a Pocket PC (PPC), a 3G phone and a desktop PC plug-in microphone. Details are listed in Table 1. The speech data across devices are recorded simultaneously. Device Configuration Format Model: HP iPAQ H2200 series Pocket PC Audio: 22kHz, 16 bits mono wav Model: NEC C616 3G phone Audio: 8 kHz, 16 bits mono wav Config: Pentium 3 996 MHz 512M Audio: 16 kHz, 16 bit mono wav Desktop PC plug-in microphone Microphone: Shure BG 1.1 cardioid Table 1. Recording devices used in the M3 corpus, together with information on system configurations and data formats. 2.2. Speaker description We invited subjects from the college community (age range from 20 to 30) to attend the three sessions of M3 data col- lection, with at least three-week intervals between sessions. The subjects speak English as well as Cantonese and/or Mandarin. We have 32 subjects (23 males and 9 females) who completed all three sessions. They form the enrolled speaker set. Another 108 subjects are later invited to pro- vide a single session of data. They form the independent speaker set. 2.3. Utterance design We designed a series of text prompts to elicit the subjects’ speech utterances that are appropriate for two purposes. First, the spoken utterances cover both English and Chinese IV  301 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007