Pattern Recognition Vol. 10, pp. 73 84.
Pergamon Press Ltd. 1978. Printed in Great Britain.
© Pattern Recognition Society
0031-3203/78/04014,~073 $0200/0
AN ISOLATED-WORD RECOGNIZER BASED ON
GRAMMAR-CONTROLLED CLASSIFICATION PROCESSES
S. RIVOIRA and P. TORASSO
C.E.N.S. - C.N.R., Politecnico di Torino, Istituto Scienza delrlnformazione,
Universitfi di Torino, Torino, Italy
(Received 25 October 1977; received for publication 16 December 1977)
Abstract - A recognizer of isolated words spoken in the Italian language is presented. Each level of
recognition (segmentation, phonemic classification and lexical recognition) is controlled by the rules of
appropriate grammars whose symbols are fuzzy linguistic variables. The recognition strategy depends on the
lexical redundancy of the protocol and is based on a classification of speech units into broad phonetic classes,
eventually followed by a classification into more detailed classes if some ambiguities still remain.
Speech recognition Speech analysis Syntactic pattern recognition
automata Fuzzy questionnaires Finite transducers
Fuzzy languages and
INTRODUCTION
Many recognition systems for isolated words have
been developed in the last years (see",z~ for a survey).
Most of them use the classical pattern recognition
methods for the recognition strategy, which involves
comparing the vector of feature representations of the
incoming word with the vector of prototype reference
patterns for each word in the lexicon.
Performances of such systems depend on many
design decisions as:
(1) the choice of the parameters, which affects ac-
curacy, response time and reference-pattern stor-
age requirements;
(2) the choice of reference patterns capturing va-
riations in pronunciations for single or multiple
speakers ;
(3) the choices of similarity measures between two
utterances and of search strategies suitable for
reducing the computation time when the voca-
bulary is large.
The parameters generally used are LPC coefficients
or energies in fixed frequency bands obtained by band
pass filtering or Fourier transformation. In a recent
work, White t3~ showed that similar performances can
be obtained with LPC or filter bank representations.
The classification algorithms are usually based on
criteria of minimum distance or maximum likelihood
between the input pattern and reference patterns,
where pattern matching is performed by dynamic
programming or by template matching following
linear time warping.
Phoneme-labelling techniques, which perform a first
gross segmentation of the input word, can be used to
decrease the response time and the memory require-
ment, but generally reduce the performances of the
system.{ 3,4)
Furthermore, as one or more reference patterns
73
must be stored for each word, changing the vocabulary
or the speaker requires a new training session. A
speaker independent recognition system has been
proposed by Rabiner and Sambur, tS} based on the
extraction of robust features and the description of the
words in the lexicon in terms of gross phonemic
classes.
Reddy" ~ suggests the need for such improvements as
more abstract and synthetic representations of the
reference patterns, more reliable segmentation and
labelling methods, recognition strategy in which more
likely candidates are selected before using expensive
matching techniques.
Weinstein et al. t6) show for the continuous speech
that an accurate acoustic-phonetic analysis allows to
reduce data without a substantial loss of information.
Phonological rules could also be applied in order to
reduce the errors generated by a rough classification
into phoneme groups. ~7)
Syntactic methods have not been widely applied to
acoustic and phonetic levels in speech recog-
nition, t9'15'22} even if FutS) suggests that: "When
patterns are very rich in structural information, and
the recognition problem requires classification and
description, then the syntactic approach seems I0 be
necessary."
Mermelstein showed that it would be useful to apply
syntactical rules "already at the acoustic level and not
only at the phonological and higher linguistic levels in
order to take advantage of the constraints between
acoustic segments to reduce the number of alternative
hypotheses to be considered at higher linguistic levels".
He said furthermore that "the transformation from
acoustic segments to phones is a prerequisite for any
speech recognition system accepting continuous input
from many speakers and allowing the use of a larger
vocabulary. ''tg~
On the other hand variations in speech require some