FAST INCREMENTAL CLUSTERING OF GAUSSIAN MIXTURE SPEAKER MODELS FOR
SCALING UP RETRIEVAL IN ON-LINE BROADCAST
J. E. Rougui, M. Rziza, D. Aboutajdine
*
GSCM, Facult´ e des Sciences Rabat
4, Av Ibn Battouta B.P. 1014 RP
-Rabat- Morocco
rougui@lina.univ-nantes.fr
{rziza, aboutaj}@fsr.ac.ma
M. Gelgon, J. Martinez
†
Polytechnic school of Nantes university
LINA FRE CNRS 2729
BP 50609 - 44306 Nantes Cedex 03 - France
{lastname}@polytech.univ-nantes.fr
ABSTRACT
In this paper, we introduce a hierarchical classification
approach in the incremental framework of speaker indexing.
The technique of incremental generation of speaker-homogeneous
segments is applied in the first phase. Then, we propose a hi-
erarchical classification approach that applied in the speaker
indexing framework. This approach benefits from the effi-
ciency of Gaussian mixture model (GMM) merge algorithm
to the high accuracy of update Gaussian mixture models which
referenced by speakers tree index. The adaptive threshold al-
gorithm reduces the cost of exploring the speakers GMM into
the balanced binary tree of speaker’s index, whose complexity
becomes logarithmic curve.
1. CONTEXT AND GOAL
The present paper is a contribution to the field of audio doc-
ument indexing and retrieval. The growing amount of multi-
media documents in digital form calls for techniques that, in
content organization and retrieval, offer performance both in
computational cost and relevance of the answers provided to
the user. In many cases, a trade-off is sought between these
two qualities.
Our work considers the case of spoken radio archives, in
which a continuous flow of speech is introduced into the in-
dexing system. News and discussion programs are quite typi-
cal of our scope, since they involve several speakers. The goal
of our proposal is to provide technology that scales up to long
spoken documents involving very many speakers. User appli-
cations built on top of this would enable fast speaker identity-
based querying or browsing. As a side remark, it is quite easy
to discard automatically occasional short jingles, thanks to
their acoustic properties, leaving quite clean speech sections.
The present work is set in the framework of probabilistic
modeling and statistical decision criteria, which is common
and effective for speaker recognition. A classical solution to
the above mentioned task would consist in partitioning the
*
This work has been conducted under the Franco-Moroccan program on
NTIC directed by INRIA.
†
LINA-GRIM, Management and summarization of multimedia data.
flow in speaker-homogeneous segments, then grouping seg-
ments originating from a single speaker. The latter opera-
tion enables valuable forms of browsing for the user (’go di-
rectly to all occurrences of such or such speaker in all radio
archives’) and also benefits to the accuracy of the system,
in the sense that gathering more acoustic data from a sin-
gle speaker enables refining his acoustic representation. We
choose to represent speaker acoustic characteristics by a prob-
ability density over a mel-cepstral multi-dimensional space.
In a usual implementation of this task, the cost of comparing
the acoustic representation from a speaker segment (notably,
the one that has just flown into the system) to all registered
speaker models has a computational cost that rises linearly
with the number of speakers. The focus of this paper is put
forward a new technique to organize the set of speakers to
obtain sub-linear cost, by avoiding exhaustive evaluation over
the set of registered speakers. In this process, the trade-off is
of course to gain significant computational cost while loos-
ing only little reliability. The task relates tightly to the very
classical issue of indexing structures for multi-dimensional
data. The database community has put forward a considerable
amount of contributions based on a variety of tree structures.
The particularity of the current problem arises from the nature
of the entities to index, namely probability distributions, for
which classical indexing structures are inappropriate.
2. HIERARCHICAL CLUSTERING OF GMMS
SPEAKERS
2.1. Principles and existing work
As the incoming audio stream produces new speakers, speaker-
homogeneous segments have to be match to the correspond-
ing speaker model (if the speaker is already enrolled). The
cost of an exhaustive comparison the acoustic representation
from a speaker segment (notably, the one that has just own
into the system) to all registered speaker models has a com-
putational cost that rises at least linearly with the number of
speakers (it can be more expensive if the matching resorts to
the history of feature vectors). The focus of this section is
to disclose a new technique to organize the set of speakers to
V 521 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006