FAST INCREMENTAL CLUSTERING OF GAUSSIAN MIXTURE SPEAKER MODELS FOR SCALING UP RETRIEVAL IN ON-LINE BROADCAST J. E. Rougui, M. Rziza, D. Aboutajdine * GSCM, Facult´ e des Sciences Rabat 4, Av Ibn Battouta B.P. 1014 RP -Rabat- Morocco rougui@lina.univ-nantes.fr {rziza, aboutaj}@fsr.ac.ma M. Gelgon, J. Martinez † Polytechnic school of Nantes university LINA FRE CNRS 2729 BP 50609 - 44306 Nantes Cedex 03 - France {lastname}@polytech.univ-nantes.fr ABSTRACT In this paper, we introduce a hierarchical classiﬁcation approach in the incremental framework of speaker indexing. The technique of incremental generation of speaker-homogeneous segments is applied in the ﬁrst phase. Then, we propose a hi- erarchical classiﬁcation approach that applied in the speaker indexing framework. This approach beneﬁts from the efﬁ- ciency of Gaussian mixture model (GMM) merge algorithm to the high accuracy of update Gaussian mixture models which referenced by speakers tree index. The adaptive threshold al- gorithm reduces the cost of exploring the speakers GMM into the balanced binary tree of speaker’s index, whose complexity becomes logarithmic curve. 1. CONTEXT AND GOAL The present paper is a contribution to the ﬁeld of audio doc- ument indexing and retrieval. The growing amount of multi- media documents in digital form calls for techniques that, in content organization and retrieval, offer performance both in computational cost and relevance of the answers provided to the user. In many cases, a trade-off is sought between these two qualities. Our work considers the case of spoken radio archives, in which a continuous ﬂow of speech is introduced into the in- dexing system. News and discussion programs are quite typi- cal of our scope, since they involve several speakers. The goal of our proposal is to provide technology that scales up to long spoken documents involving very many speakers. User appli- cations built on top of this would enable fast speaker identity- based querying or browsing. As a side remark, it is quite easy to discard automatically occasional short jingles, thanks to their acoustic properties, leaving quite clean speech sections. The present work is set in the framework of probabilistic modeling and statistical decision criteria, which is common and effective for speaker recognition. A classical solution to the above mentioned task would consist in partitioning the * This work has been conducted under the Franco-Moroccan program on NTIC directed by INRIA. † LINA-GRIM, Management and summarization of multimedia data. ﬂow in speaker-homogeneous segments, then grouping seg- ments originating from a single speaker. The latter opera- tion enables valuable forms of browsing for the user (’go di- rectly to all occurrences of such or such speaker in all radio archives’) and also beneﬁts to the accuracy of the system, in the sense that gathering more acoustic data from a sin- gle speaker enables reﬁning his acoustic representation. We choose to represent speaker acoustic characteristics by a prob- ability density over a mel-cepstral multi-dimensional space. In a usual implementation of this task, the cost of comparing the acoustic representation from a speaker segment (notably, the one that has just ﬂown into the system) to all registered speaker models has a computational cost that rises linearly with the number of speakers. The focus of this paper is put forward a new technique to organize the set of speakers to obtain sub-linear cost, by avoiding exhaustive evaluation over the set of registered speakers. In this process, the trade-off is of course to gain signiﬁcant computational cost while loos- ing only little reliability. The task relates tightly to the very classical issue of indexing structures for multi-dimensional data. The database community has put forward a considerable amount of contributions based on a variety of tree structures. The particularity of the current problem arises from the nature of the entities to index, namely probability distributions, for which classical indexing structures are inappropriate. 2. HIERARCHICAL CLUSTERING OF GMMS SPEAKERS 2.1. Principles and existing work As the incoming audio stream produces new speakers, speaker- homogeneous segments have to be match to the correspond- ing speaker model (if the speaker is already enrolled). The cost of an exhaustive comparison the acoustic representation from a speaker segment (notably, the one that has just own into the system) to all registered speaker models has a com- putational cost that rises at least linearly with the number of speakers (it can be more expensive if the matching resorts to the history of feature vectors). The focus of this section is to disclose a new technique to organize the set of speakers to V  521 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006