EFFECTS OF DEVICE MISMATCH, LANGUAGE MISMATCH AND ENVIRONMENTAL
MISMATCH ON SPEAKER VERIFICATION
Bin Ma
1
, Helen M. Meng
1
and Man-Wai Mak
2
1
Dept. of Systems Engineering and Engineering Management, The Chinese Univerisity of Hong Kong
2
Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University
1
{bma, hmmeng}@se.cuhk.edu.hk,
2
enmwmak@polyu.edu.hk
ABSTRACT
Device, language and environmental mismatch adversely
affect speaker verification (SV) performance. We investi-
gate such effects empirically based on the M3 (multi-
biometric, multilingual and multi-device) Corpus [1]. De-
vice mismatch (among 3G phone, PocketPC and a desktop
PC plug-in microphone) brings relative performance degra-
dation of 523%; language mismatch (between English and
Cantonese) brings 284% and environmental mismatch (be-
tween office environment and recording studio) brings
109%. In particular, verification with wide-band models on
narrow-band test data outperforms narrow-band models on
wide-band test data. The 3G phone’s SV performance is
generally low, but remains stable across environments. Ad-
ditionally, durational variations within two-second utter-
ances may cause a relative change of 633% in SV perform-
ance.
Index Terms— Speaker verification, biometrics corpus,
M3 speaker verification evaluation
1. INTRODUCTION
Speaker verification is the process of authenticating the
speaker’s claimed identity based on his/her input utterances.
This technology plays a key role in securing computing for
human-centric computer interfaces. In real-time applica-
tions, the proliferation of mobile, handheld devices present
challenges for speaker verification. For example, mobile
use means that speaker verification technically needs to
handle a variety of environmental conditions. Also, differ-
ent audio input devices (e.g., microphones on PDAs or cell-
phones) may induce significant variations in the quality of
captured speech. Some techniques, such as feature mapping
[2], speaker model synthesis [3] and handset normalization
[4], have been proposed to alleviate this problem. The lan-
guage uttered may also affect SV performance, as demon-
strated in our previous work [5]. The length of testing ut-
terance segments is another factor affecting SV perform-
ance. In particular, it has been shown that the EER of SV
system is exponentially related to the length of test segment
[6]. The current study attempts to qualify such effects based
on SV experiments with the M3 speech data, which contains
multilingual, multi-device and data for mobile use, as will
be elaborated later.
2. THE SPEECH DATA OF M3 CORPUS
The M3 corpus is designed to support research in multi-
biometric technologies for pervasive computing using mo-
bile devices. Three kinds of biometrics, three devices, as
well as three languages, are included in M3. Our research
focuses on the speech data in M3. A brief introduction to
M3 speech data is presented in this section.
2.1 Speech data collection setup
During data collection, the multilingual speech data are cap-
tured from multiple devices from two recording conditions:
an open laboratory and a recording room. The devices in-
clude a Pocket PC (PPC), a 3G phone and a desktop PC
plug-in microphone. Details are listed in Table 1. The
speech data across devices are recorded simultaneously.
Device Configuration Format
Model: HP iPAQ H2200 series
Pocket PC
Audio: 22kHz, 16 bits mono wav
Model: NEC C616
3G phone
Audio: 8 kHz, 16 bits mono wav
Config: Pentium 3 996 MHz 512M
Audio: 16 kHz, 16 bit mono wav
Desktop PC
plug-in
microphone Microphone: Shure BG 1.1 cardioid
Table 1. Recording devices used in the M3 corpus, together with
information on system configurations and data formats.
2.2. Speaker description
We invited subjects from the college community (age range
from 20 to 30) to attend the three sessions of M3 data col-
lection, with at least three-week intervals between sessions.
The subjects speak English as well as Cantonese and/or
Mandarin. We have 32 subjects (23 males and 9 females)
who completed all three sessions. They form the enrolled
speaker set. Another 108 subjects are later invited to pro-
vide a single session of data. They form the independent
speaker set.
2.3. Utterance design
We designed a series of text prompts to elicit the subjects’
speech utterances that are appropriate for two purposes.
First, the spoken utterances cover both English and Chinese
IV 301 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007