On calibration of language recognition scores Niko Br¨ ummer Spescom Datavoice Stellenbosch, South Africa. nbrummer@za.spescom.com David A. van Leeuwen TNO Human Factors Soesterberg, the Netherlands david.vanleeuwen@tno.nl Abstract Recent publications have examined the topic of calibration of confidence scores in the field of (binary-hypothesis) speaker de- tection. We extend this topic to the case of multiple-hypothesis language recognition. We analyze the structure of multiple- hypothesis recognition problems to show that any such problem subsumes a multitude of derived sub-problems and that there- fore the calibration of all of these problems are interrelated. We propose a simple global calibration metric that can be generally applied to a multiple-hypothesis problem and then demonstrate experimentally on some NIST-LRE-’05 data how this relates to the calibration of some of the derived binary-hypotheses sub- problems. 1. Introduction: What is calibration? There has been much recent interest in the topic of calibration of speaker detection confidence measures [1, 2, 8, 9, 10, 6]. This paper extends this topic to the case of language recogni- tion. Calibration in language recognition is qualitatively differ- ent, because in language recognition there are multiple instead of just two hypotheses. The issue of calibration of language recognition scores has been addressed in the NIST Language Recognition Evalua- tions (LRE’s) [5], via the pooling (over all target languages) of one-against-the-rest detection scores. The calibration of these pooled scores were then analyzed with the same tools (DET- curves, EER and ‘min CDET’) that are familiar in the NIST Speaker Recognition Evaluations (SREs) [6]. However in dis- cussions and presentations at the December 2005 LRE Work- shop, it became clear that there are some problems associated with the analysis of pooled scores. Briefly, all of these analy- sis methods assume the use of a single decision threshold, but there cannot be a single threshold that is valid for the pooled scores. This paper is intended to be a constructive response to this analysis problem. In summary, we propose two alternate calibration analyses. One is simply to keep scores for differ- ent targets separate and to analyze them separately. The other involves a global calibration transformation of the relative like- lihoods of all the languages. We introduce the topic by giving an intuitive definition of calibration. The purpose of speech processing technology is to extract relevant information from speech. If the technology is good, then this information should enable the user to derive benefit from employing it. In general, the ‘better’ the quality of this information, the more benefit can be derived. There are many different ways to measure quality of in- formation. Indirect measurements judge the benefit derived from using information in specific applications. The most well- known indirect measure of information is to employ the infor- mation to make recognition decisions (such as who is speaking, what is being said, in what language) and then to estimate error- rates. It is also possible to directly measure the empirical amount of information, in bits of Shannon entropy, that a given speech technology delivers to the user in a set of supervised recognition trials. In fact, as we have pointed out in previous work, there is a very direct relationship between error-rates and information. The information delivered to the user can be expressed as a total error-rate, obtained when integrating the average error-rate of a recognizer over a wide range of operating points [2]. In this paper, we shall perform an analysis of the informa- tion flow through a language recognition system. Most impor- tantly, we want to be able to measure the amount of information that is delivered to the user by the recognizer. This measure- ment is not so difficult — all you need is a supervised NIST Evaluation database and equation 17. But having achieved this, we want to further our analysis to help us to improve the information delivery. We want to de- compose our information measure into two components. These components address two important issues: Content: The information must be there 1 . The result delivered to the user must actually contain the information that we are interested in. In [2] we used the term discrimination for a (direct) measure of the information contents. Other authors use the term refinement [3, 8, 4]. Well-known indirect measures of information contents include error- rate measures such as equal-error-rate (EER) and cost- based measures such as ‘min CDET’ as used in the NIST Speaker Recognition Evaluations [6]. Form: The information must be in a standard form that is eas- ily interpretable by the user. The user should be able to directly employ the information in standard ways, with- out needing further knowledge specific to the properties of the recognizer. Even if the information is present, if the user misunderstands the information, the information cannot be employed to the user’s benefit. This quality of form of the result delivered to the user is termed calibra- tion [3, 2, 8, 4]. In summary, the decomposition we want to perform is: infor- mation delivered to user = information present - information lost via misinterpretation. 1 If it is not there in the first place, no further interpretation of the result can extract more information. This can be formally expressed via the data processing inequality. 1 1-4244-0472-X/06/$20.00 c 2006 IEEE