Pattern Recognition Vol. 10, pp. 73 84. Pergamon Press Ltd. 1978. Printed in Great Britain. © Pattern Recognition Society 0031-3203/78/04014,~073 $0200/0 AN ISOLATED-WORD RECOGNIZER BASED ON GRAMMAR-CONTROLLED CLASSIFICATION PROCESSES S. RIVOIRA and P. TORASSO C.E.N.S. - C.N.R., Politecnico di Torino, Istituto Scienza delrlnformazione, Universitfi di Torino, Torino, Italy (Received 25 October 1977; received for publication 16 December 1977) Abstract - A recognizer of isolated words spoken in the Italian language is presented. Each level of recognition (segmentation, phonemic classification and lexical recognition) is controlled by the rules of appropriate grammars whose symbols are fuzzy linguistic variables. The recognition strategy depends on the lexical redundancy of the protocol and is based on a classification of speech units into broad phonetic classes, eventually followed by a classification into more detailed classes if some ambiguities still remain. Speech recognition Speech analysis Syntactic pattern recognition automata Fuzzy questionnaires Finite transducers Fuzzy languages and INTRODUCTION Many recognition systems for isolated words have been developed in the last years (see",z~ for a survey). Most of them use the classical pattern recognition methods for the recognition strategy, which involves comparing the vector of feature representations of the incoming word with the vector of prototype reference patterns for each word in the lexicon. Performances of such systems depend on many design decisions as: (1) the choice of the parameters, which affects ac- curacy, response time and reference-pattern stor- age requirements; (2) the choice of reference patterns capturing va- riations in pronunciations for single or multiple speakers ; (3) the choices of similarity measures between two utterances and of search strategies suitable for reducing the computation time when the voca- bulary is large. The parameters generally used are LPC coefficients or energies in fixed frequency bands obtained by band pass filtering or Fourier transformation. In a recent work, White t3~ showed that similar performances can be obtained with LPC or filter bank representations. The classification algorithms are usually based on criteria of minimum distance or maximum likelihood between the input pattern and reference patterns, where pattern matching is performed by dynamic programming or by template matching following linear time warping. Phoneme-labelling techniques, which perform a first gross segmentation of the input word, can be used to decrease the response time and the memory require- ment, but generally reduce the performances of the system.{ 3,4) Furthermore, as one or more reference patterns 73 must be stored for each word, changing the vocabulary or the speaker requires a new training session. A speaker independent recognition system has been proposed by Rabiner and Sambur, tS} based on the extraction of robust features and the description of the words in the lexicon in terms of gross phonemic classes. Reddy" ~ suggests the need for such improvements as more abstract and synthetic representations of the reference patterns, more reliable segmentation and labelling methods, recognition strategy in which more likely candidates are selected before using expensive matching techniques. Weinstein et al. t6) show for the continuous speech that an accurate acoustic-phonetic analysis allows to reduce data without a substantial loss of information. Phonological rules could also be applied in order to reduce the errors generated by a rough classification into phoneme groups. ~7) Syntactic methods have not been widely applied to acoustic and phonetic levels in speech recog- nition, t9'15'22} even if FutS) suggests that: "When patterns are very rich in structural information, and the recognition problem requires classification and description, then the syntactic approach seems I0 be necessary." Mermelstein showed that it would be useful to apply syntactical rules "already at the acoustic level and not only at the phonological and higher linguistic levels in order to take advantage of the constraints between acoustic segments to reduce the number of alternative hypotheses to be considered at higher linguistic levels". He said furthermore that "the transformation from acoustic segments to phones is a prerequisite for any speech recognition system accepting continuous input from many speakers and allowing the use of a larger vocabulary. ''tg~ On the other hand variations in speech require some