Proceedings of the Seventh European Conference on Speech Communication and Technology (Eurospeech 2001), Aalborg, Danemark, September 3-7, v. 2, 967-970. Generating Duration from a Cognitively Plausible Model of Rhythm Production Plínio A. Barbosa Lab. of Phonetics and Psycholinguistics & Dep. of Linguistics, IEL/UNICAMP, Brazil plinio@iel.unicamp.br Abstract A dynamical model of rhythm production is presented. This model is meant to generate segmental duration from the interplay between a dynamical rhythmic system and a gestural score representation. The rhythmic level is being implemented by a coupled-oscillator system which delivers V-to-V size beats to the gestural score. To explain segment and pause acoustic durations, the interaction between rhythmic and gestural representations is achieved by a recurrent neural network. The model exhibits cognitively plausible language universal and language-specific phonetic properties which explain the variability of acoustic duration data. 1. Introduction The dynamical approach to cognition [1] advocates that “natural cognitive systems are dynamical systems and are best understood from the perspective of dynamics” or, in classical terms, “to ignore movement it to ignore Nature” (Aristotle). Scientists “understand” a natural system when they are able to adequately model such a system, that is, when the variability exhibited at the model output closely follows the variability of natural data in response to equivalent input. Speaking or hearing subsume the interplay between two kinds of knowledge: linguistic representations (subject to the properties of formal syntax), and biomechanical and biochemical systems (subject to the laws of Dynamics). Unfortunately, few models of speech production adequately capture these two kinds of knowledge, maybe because they involve the integration of discrete and real variables. The two speech production models presented in the next section constitute an exception to that. On the other hand, recent technological solutions to speech synthesis ignore relevant issues in speech production modeling altogether. 2. Cognitively plausible models of speech production Since the eighties, Articulatory Phonology [2], [3] (henceforth, AP) is a paradigmatic example of cognitively (and linguistically) plausible modeling. AP has straightforwardly explained phonetic and phonological variation from abstract, dynamical representations of articulatory gestures. These gestures are considered to be pre- linguistic forms of action. The success of this theory in modeling (linguistic) phonetic data is mainly due to the fact that abstract gestures have intrinsic time intervals which allow them to overlap in time. Some criticism against the theory has mainly concentrated on the its alleged inability to taking into account categorical phenomena [4]. This kind of criticism can be avoided by lexicalizing categorical allomorphy and allophony as well as by associating morphological labels to gestures’ edges [5]. To our understanding, a more serious drawback in AP framework concerns its inability to consider gesture coordination above the lexical level. In [6], a rhythmic tier is proposed to fill this gap, but the idea was not fully implemented and still lack the clear definition of extrinsic timing. The so-called Temporal Phonology [7] explores this very notion and considers duration variability as a consequence of the dynamical coupling of a set of adaptative (or coupled) oscillators that are able to deal with metrics (structure) and cadence in speech rhythm. Both Articulatory and Temporal Phonologies try to understand the nature and organization of phonological representations from accurate analyses of speech production data. As adequate models of speech production, both frameworks are able to generate acoustic parameters from abstract input and serve the goal of articulatory as well as acoustic speech synthesis research. Two other lines of research in speech generation systems can be considered as output- oriented models of speech production and are perhaps too close to market needs. 3. Output-oriented models of speech generation Certainly in order to respond to technological demands, recent speech synthesis research preferred to assume its ignorance about the interplay between linguistic and physical parameters and to refuse to taking into account some well documented phonetic facts. Both use linguistic input as a way of classifying data to guide the record of huge speech corpora and to guide statistical analyses and searching procedures which accurately describe the acoustic output. The so-called corpus synthesis [8] is an economically valuable technological solution to concatenative speech synthesis which refuses to investigate the interaction between prosody and segments as a worthwhile issue for explaining the variability exhibited by the acoustic parameters. This kind of work builds huge speech corpora containing units of variable size recorded under different prosodic conditions and uses complex search techniques which minimize paradigmatic discrepancy at the (abstract) input and which minimize syntagmatic discrepancy at the (physical) output. The powerful statistical techniques used in van Santen’s work to model segmental duration [9] are an example of research blindness. By guiding its statistical analyses from some results on the relation of linguistic and phonetic variables, this kind of research is able to accurately predict segmental duration. But it fails in explaining crucial sources of variability as speech rate, in considering the problem of pause emergence (without postulating pause duration and location from the beginning of the generation process), and in taking into account some well documented phonetic facts. One of