Application of Neural Networks for POS Tagging and Intonation Control in Speech Synthesis for Polish Artur Janicki Warsaw University of Technology, Institute of Telecommunications, Division of Teletransmission Systems ul. Nowowiejska 15/19, 00-665 Warsaw, Poland email:ajanicki@elka.pw.edu.pl Abstract - The paper describes use of neural networks in POS (part-of-speech) tagging and intonation control, needed in a speech synthesis system for the Polish language. Feedforward multilayered perceptrons have been proposed for both purposes. Considerations during planning the network architecture, used training data, training process and verification of the results are described. I. INTRODUCTION AND PROBLEM DESCRIPTION Control of intonation in TTS (text-to-speech) systems is one of the most difficult tasks in speech synthesis. It has a significant impact on listening comfort. When generating a synthetic speech, first task is to generate a signal intelligible to the listener. But soon after it is achieved, the next task is to make the speech sound as natural as possible. If this is missing, the speech makes the listener tired and discourages him from using such a TTS system. To achieve a naturally-sounding speech we need a careful control of so called prosodic parameters, i.e. duration, intonation, pausing, rhythm, energy etc., of which intonation is one of the most important. Controlling the intonation means generating a proper F0 function, i.e. function of changes of fundamental frequency, corresponding to a given sequence of words. The task is not trivial at all [9], because the intonation depends on the meaning of a phrase and also carries the information about emotions of the speaker (anger, surprise, excitation etc.). So there is a need of some mechanism which would retrieve at least some basic information from a text in natural language (in this case: in Polish) and generate upon it a natural-like F0 contour. It is possible that more than one F0 contour for a given sentence will sound natural to the listener; on the other hand we should strongly avoid the situation when the system uses e.g. an incorrect accent and lets the listener realize, that the TTS system does not really understand what it is saying. II. PROPOSED APPROACH It has been proposed to correlate the F0 contour with sequence of POS (part-of-speech) tags, corresponding to words in a phrase. POS type, i.e. information if a given word is a noun, a verb, an adjective or other, is definitely related to a role which a given word plays in the sentence and is somehow related to its meaning. Similar approaches were successful for other languages [3]. Carrying emotions is beyond scope of this work – neutral emotion will we applied instead. To use a system for controlling intonation basing on POS- tags sequence, we need to be able to get know the POS type information of given words, i.e. we need a POS tagger, in this case for Polish. For both purposes, i.e. for POS tagging and for generating F0 contours, neural networks have been proposed, namely multilayered perceptrons – MLP [8]. Neural networks were already used before in POS tagging, e.g. in POS disambiguation for English [10], however the proposed approach of POS recognition does not use a lexicon at all. The details are described in the following chapters. III. PART OF SPEECH TAGGING A. Proposed neural network architecture A neural network to perform POS tagging is expected to have 15 binary outputs, corresponding to 15 POS tags, which it is supposed to recognize (see Table 1). The question of what to take as an input requires more attention. First, it has been decided to ignore the case information. It may come in useful in the future: it can be helpful in disambiguation, detection of proper names etc., but for the time being it has been decided to ignore it, in order not to make the architecture too complex. The next question was: do we need to analyze a whole word? The more characters we have at the input, the more precise we are, but on the other hand the network looses its ability to generalize its answers. So if a new word comes, e.g. a neologism or an inflected form of a word, the network would likely fail. The smaller number of characters we take, the more we are exposed to ambiguities, but the network becomes more “wise”, because it learns some rules, instead of learning the words by heart. In this case we need also a smaller amount of the training data. To sum up, we need to