Exploring Session Variability and Template Aging in Speaker Veriﬁcation for Fixed Phrase Short Utterances Rohan Kumar Das, Sarfaraz Jelil and S. R. Mahadeva Prasanna Department of Electronics and Electrical Engineering , Indian Institute of Technology Guwahati, Guwahati-781039, India {rohankd, sarfaraz, prasanna}@iitg.ernet.in Abstract This work highlights the impact of session variability and tem- plate aging on speaker veriﬁcation (SV) using ﬁxed phrase short utterances from the RedDots database. These have been col- lected over a period of one year and contain a large number of sessions per speaker. Session variation has been found to have a direct inﬂuence on SV performance and its signiﬁcance is even greater for the case of ﬁxed phrase short utterances as a very small amount of speech data is involved for speaker modeling as well as testing. Similarly for a practical deployable SV sys- tem when there is large session variation involved over a period of time, the template aging of the speakers may effect the SV performance. This work attempts to address some issues related to session variability and template aging of speakers which are found for data having large session variability, that if considered can be utilized for improving the performance of an SV system. Index Terms: speaker veriﬁcation, session variability, template aging 1. Introduction The current achievements in the ﬁeld of speaker veriﬁcation (SV) have found wide spread use in various application oriented services. These application oriented services mainly focus on short utterances for recognizing speakers due to the constraint of time involved which can provide feasibility in deployment. Fixed phrase short utterances provide the basis of the short ut- terance case as less amount of time is involved during training and testing. However when we go for deployable systems with regular use and for a long period of time, the effect of session variability and template aging may reﬂect some degradation in recognition performance. Several works have been done in the past to address the is- sue of session variability. In [1], the authors explicitly model session variability by generating a session dependent factor in a low dimensional subspace. The efﬁcacy of this approach is proved for the NIST database in a text-independent framework, which clearly showed the signiﬁcance of session variation for SV performance. Another way to handle this session variation is to have session compensation techniques to reduce the effect of session variability. There are different approaches for session compensation, some of which are joint factor analysis (JFA), linear discriminant analysis (LDA), nuisance attribute projec- tion (NAP) etc. [2, 3, 4]. These approaches are found to help SV performance by providing session compensation. The au- thors in [5] have made a comparison of different session vari- ability compensation approaches in SV. The work reported in [6] proposes an approach based maximum-likelihood linear re- gression (MLLR) adaptation that transforms for multiple recog- nition models and phone classes for session variability normal- ization which improves the SV performance. Thus, the impact of session variation is found to be very crucial for SV perfor- mance. The aging phenomenon in different biometrics has been an interesting aspect for dealing with cutting edge technologies from a practical deployable system point of view [7]. Consider- ing speech biometric based systems, the aging effect of speaker models has not been addressed to a large extent. The studies of [8] carried out on 22 speakers data collected for three ses- sions with 1-2 months of gap show that time lapse in test ses- sion degrades the performance to an extent. In [9], the authors have made studies on long term aging data over 18 speakers for 30-60 years span that show the genuine scores of speakers are affected severely than that of the impostor scores with the aging of the speaker templates. The work in [10] reports that the error rate doubles when the train and the test sessions have an interval of more than a month. In [11], the author conducts a study for exploring the aging effect for data collected for an interval of four years and reports the amount of degradation in performance is gradually more for the trials having larger time interval from training. The limited exploration in the area of template aging is mainly due to the lack of availability in databases having large session variation from a sizeable population of speakers. The recently made available data as a part of RedDots project has opened the doors towards exploring template aging for ﬁxed phrase short utterances [12]. In this current work, the effect of session variability is addressed by the creation of a speaker model with session variated three templates (ﬁrst, middle and last sessions) and then testing by remaining templates of the RedDots dataset. This framework for creation of speaker mod- els by data having large session variation is expected to perform better than that of the baseline due to consideration of the ses- sion variability for speaker modeling. Further, template aging studies are conducted with creation of speaker models with two approaches, where the ﬁrst one is based on creation of speaker models with ﬁrst three sessions and the latter is using the last three sessions. It is hypothesized that there may be a signiﬁcant difference in speaker characteristics from ﬁrst three sessions to last three session that is collected over a span of one year, which can be critical from the perspective of practical system for de- ployment. The novelty of this work lies in addressing effect of session variability and template aging to some extent with anal- ysis. This knowledge can be utilized for a practical SV based framework under regular use for deployment. The remaining paper is compiled in the following order: Section 2 explains the development of baseline SV system for the RedDots challenge. In Section 3 the proposed framework Copyright  2016 ISCA INTERSPEECH 2016 September 8–12, 2016, San Francisco, USA http://dx.doi.org/10.21437/Interspeech.2016-1001 445