Integrating Speaker and Speech Recognizers: Automatic Identity Claim Capture for Speaker Veriﬁcation Larry Heck, Dominique Genoud Nuance Communications, Menlo Park, CA 94025, USA heck@nuance.com genoud@nuance.com Abstract This paper presents a novel approach to the integration of a speech and speaker recognizer for the purpose of auto- matically capturing an identity claim of a user. The ap- proach integrates the speaker recognition score into the search process of the speech recognizer resulting in a best hypothesis that jointly optimizes the probability of the word sequence and the speaker. This facilitates the use of a natural speech-based interface, where the identity claim can be ambiguous and relatively difﬁcult to recognize (e.g., names). This paper presents a theoretical frame- work for the integration of speech and speaker recogni- tion systems. In addition, experimental results are pre- sented that show a 35% reduction in the NL-error rate of an over-the-telephone speech recognition task, where the testset consists of users from a US city of size 1 million identifying themselves by simply speaking their name. 1. Introduction One of the most challenging tasks for commercial speaker veriﬁcation systems is to design a natural, conve- nient interface for capturing the identity claim of a user. For telephone applications, many of the current systems rely on a DTMF-based approach, where the user claims their identity by entering their account number through the telephone’s touch-tone keypad. After the identity claim is established, the system veriﬁes the claim by ask- ing the user to speak a phrase (e.g., password, random phrase) and then scores this utterance against a speaker model of the user created in a previous enrollment ses- sion. While a DTMF-based approach to capturing the iden- tity claim of a user has been widely adopted, we have ob- served in our trials/deployments that a majority of peo- ple prefer a more natural, speech-based interface that al- lows them to simply speak their identity claim. Auto- matic speech recognition systems can be used to capture the identity claim, but often recognition performance is poor over large populations. This is particularly true with personal names as well as, in some cases, telephone num- bers (in high noise environments). In addition, the iden- tity claim is often not unique over large populations (e.g., John Smith), which introduces a further complication of how to resolve this ambiguity without requiring the user to provide additional information. Typical speech-based approaches to capture the iden- tity claim have solely relied on general speech recog- nition technology. However, given that the the iden- tity claim is spoken by the individual associated with the claim, speaker recognition/veriﬁcation technology could by utilized to determine if the voice of the talker matches the speaker model associated with the identity claim. Somewhat related work exists in [4, 5], where the approaches attempted to use speaker recognizers to effectively “quantize” the speaker into speech recogni- tion acoustic models built on similar sounding speak- ers. However, these approaches were focused on reduc- ing speaker mismatch in the speech recognizer, and not focused on solving the problem of using the speaker rec- ognizer to improve the performance of identity claim cap- ture. This paper presents a novel approach to the integra- tion of a speech and speaker recognizer for the purpose of automatically capturing an identity claim of a user. Sec- tion 2 formulates the problem of identity claim capture in the more general context of integrating a speech and speaker recognition system. Sections 3 and 4 brieﬂy de- scribe the speaker and speech recognition systems used in this study. Section 5 presents the computational im- pact of integrating a speech and speaker recognizer for identity claim capture. Finally, experiments for both digit and name-based identity claim over large populations are presented in Section 6. 2. Mathematical Formulation The problem of capturing the identity claim of a user through a speech utterance can be expressed in the more general context of integrating a speech and speaker recognition system. The goal is to ﬁnd the word se- quence and the speaker with the highest joint probability among all possible word sequences and speakers , which is conditioned on a feature vector sequence Every identity claim in the dictio- nary is mapped to a sequence of HMMs which them- selves consist of states q, such that every word is equiva- 2001: A Speaker Odyssey The Speaker Recognition Workshop Crete, Greece June 18-22, 2001 ISCA Archive http://www.iscaĆspeech.org/archive