Research Article
A Russian Keyword Spotting System Based on Large Vocabulary
Continuous Speech Recognition and Linguistic Knowledge
Valentin Smirnov,
1
Dmitry Ignatov,
1
Michael Gusev,
1
Mais Farkhadov,
2
Natalia Rumyantseva,
3
and Mukhabbat Farkhadova
3
1
Speech Drive LLC, Saint Petersburg, Russia
2
V.A. Trapeznikov Institute of Control Sciences of RAS, Moscow, Russia
3
RUDN University, Moscow, Russia
Correspondence should be addressed to Mais Farkhadov; mais.farhadov@gmail.com
Received 11 July 2016; Revised 27 October 2016; Accepted 14 November 2016
Academic Editor: Alexey Karpov
Copyright © 2016 Valentin Smirnov et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Te paper describes the key concepts of a word spotting system for Russian based on large vocabulary continuous speech
recognition. Key algorithms and system settings are described, including the pronunciation variation algorithm, and the
experimental results on the real-life telecom data are provided. Te description of system architecture and the user interface
is provided. Te system is based on CMU Sphinx open-source speech recognition platform and on the linguistic models and
algorithms developed by Speech Drive LLC. Te efective combination of baseline statistic methods, real-world training data, and
the intensive use of linguistic knowledge led to a quality result applicable to industrial use.
1. Introduction
Te need to understand business trends, ensure public secu-
rity, and improve the quality of customer service has caused a
sustainable development of speech analytics systems which
transform speech data into a measurable and searchable
index of words, phrases, and paralinguistic markers. Keyword
spotting technology makes a substantial part of such systems.
Modern keyword spotting engines usually rely on either of
three approaches, namely, phonetic lattice search [1, 2], word-
based models [3, 4], and large vocabulary speech recognition
[5]. While each of the approaches has got its pros and cons
[6] the latter starts to be prominent due to public availability
of baseline algorithms, cheaper hardware to run intensive
calculations required in LVCSR and, most importantly, high-
quality results.
Most recently a number of innovative approaches to
spoken term detection were ofered such as various recogni-
tion system combination and score normalization, reporting
20% increase in spoken term detection quality (measured
as ATWV) [7, 8]. Deep neural networks application in
LVCSR is starting to achieve wide adoption [9]. Tanks to
the IARPA Babel program aimed at building systems that
can be rapidly applied to any human language in order to
provide efective search capability for analysts to efciently
process massive amounts of real-world recorded speech [10]
in recent years wide research has been held to develop
technologies for spoken term detection systems for low-
resource languages. For example, [11] describes an approach
for keyword spotting in Cantonese based on large vocabulary
speech recognition and shows positive results of applying
neural networks to recognition lattice rescoring. Reference
[12] provides an extensive description of modern methods
used to build a keyword spotting system for 10 low-resource
languages with primary focus on Assamese, Bengali, Haitian
Creole, Lao, and Zulu. Deep neural network acoustic models
are used both as feature extractor for a GMM-based HMM
system and to compute state posteriors and convert them into
scaled likelihoods by normalizing by the state priors. Data
augmentation via using multilingual bottleneck features is
ofered (the topic is also covered in [13]). Finally language
independent and unsupervised acoustic models are trained
Hindawi Publishing Corporation
Journal of Electrical and Computer Engineering
Volume 2016, Article ID 4062786, 9 pages
http://dx.doi.org/10.1155/2016/4062786