Approaching Human Listener Accuracy with Modern Speaker Verification Ville Hautam¨ aki, Tomi Kinnunen, Mohaddeseh Nosratighods, Kong-Aik Lee, Bin Ma, and Haizhou Li Institute for Infocomm Research (I 2 R), A*STAR, Singapore School of Computing, University of Eastern Finland, Joensuu, Finland School of Electrical Engineering and Telecommunication, University of New South Wales, Australia {vishv,kalee,mabin,hli}@i2r.a-star.edu.sg, tkinnu@cs.joensuu.fi, hadis@unsw.edu.au Abstract Being able to recognize people from their voice is a natural abil- ity that we take for granted. Recent advances have shown sig- nificant improvement in automatic speaker recognition perfor- mance. Besides being able to process large amount of data in a fraction of time required by human, automatic systems are now able to deal with diverse channel effects. The goal of this paper is to examine how state-of-the-art automatic system performs in comparison with human listeners, and to investi- gate the strategy for human-assisted form of automatic speaker recognition, which is useful in forensic investigation. We set up an experimental protocol using data from the NIST SRE 2008 core set. A total of 36 listeners have participated in the listen- ing experiments from three sites, namely Australia, Finland and Singapore. State-of-the-art automatic system achieved 20% er- ror rate, whereas fusion of human listeners achieved 22%. 1. Introduction It is a long-believed fact that while computers are faster in pro- cessing large amounts of data, they cannot outperform human accuracy in real-world pattern recognition tasks. Human beings are outstanding in recognizing spoken words (speech content) under varying conditions including background noises, trans- mission channels, reverberation and presence of other interfer- ing speakers. The reason is that humans rely on several dif- ferent levels of information in the speech signal to recognize others from voice alone. These cues might be a certain usage of words, speaking habits or a unique style in a person’s laughter. It is complicated to extract the speaking habit or style of a per- son automatically. Therefore automatic systems mostly rely on low-level spectral features to discriminate speakers. However, spectral features are susceptible to any environmental and intra- speaker variation and, compared to human-based detection sys- tems, they are usually less robust under severe mismatched con- ditions.While the best-performing automatic speech recognition (ASR) systems can already handle some of these conditions quite well, it remains a great engineering challenge to make the systems robust under all those conditions [1]. What about the speaker and language recognition accu- racy of human beings? It can be argued that the speech con- tent (words) and the affective cues (emotions and attitudes) are the most important information for social communication be- tween human beings. But what would be the advantage of be- ing able to recognize different speakers and languages? It can be hypothesized that, at the best, the speaker and language cues are of secondary importance. It is of great scientific interest, then, to know whether the automatic methods could outper- form human being(s) in speaker and language recognition tasks. When developing new speaker and language recognition meth- ods, should we take the human being as our benchmark? Such questions are also of great importance for forensic audio anal- ysis where a mixture of automatic and semi-automatic methods and aural recognition are commonly used [1, 2]. A couple of studies have compared human and machine performance in the speaker [3, 1, 4] and language [5] recog- nition tasks. In this paper we focus on the speaker recogni- tion task (Table 1). One of the most extensive comparisons be- tween aural and automatic systems has been conducted a decade ago [3]. In that study, human speaker recognition performance was compared to three automatic systems on the NIST 1998 speaker recognition evaluation (SRE) data. The average human equal error rate (EER) of all trials was 23 %. The accuracy was improved to 12 % after combining all the listeners’ veri- fication scores by averaging. In matched channel conditions, human mean and best automatic systems gave both an EER of 8 %. However, in channel-mismatched condition, human mean was 14 % EER whereas machine accuracy was degraded to 24 % EER, supporting the assumption that humans are more robust under signal distortions. It should be noted, however, that while human average was good, there were large variances between the individual listeners. It is also noteworthy that the listening experiment in [3] was done in a controlled laboratory environ- ment where the listeners needed to make decisions within short intervals. More recently, in [1] human speaker recognition perfor- mance was compared against automatic system in a forensic setting. Unlike in [3] where the listening was strictly controlled, the subjects could listen to the material as long as they wanted. The material included forensic material (French polyphone- IPSC02 corpus) from 10 speakers under three different condi- tions. There were as many as 90 listeners, each listening to 25 verification trials. The accuracy was compared to Gaussian mixture model (GMM) recognizer using perceptual linear pre- diction (PLP) features. The conclusions were similar to [3]: under channel mismatch, human listening pool outperformed the automatic system. It was found, interestingly, that the au- tomatic system outperformed humans in the matched channel conditions. In another study [4], focusing mainly on speech disguise but also comparing average human accuracy to a more modern Gaussian mixture model - universal background model (GMM- UBM) [6] system, the authors had a self-collected corpus with 32 speakers recorded in four different sessions. In two or more of the sessions the speakers were asked to disguise their voice to not sound like themselves. The listener pool included 25 listeners and, similar to [3], the listening was done under con- trolled set-up where the listeners could not play with the sam-