Improved humanoid vocalization acquisition from a human tutor Enzo Mumolo and Massimiliano Nolich Abstract— This paper describes an approach to automatically acquire the vocalization for an humanoid robot by learning from a human tutor. The learned vocalization can be used to multimodal reproduction of speech, based on articulatory and acoustic parameters that compose the vocalization database. The proposed algorithm can synthesize speech utterances from unrestricted text and generate articulatory and facial move- ments of the humanoid talking face synchronized with the generated speech. The fuzzy articulatory rules are derived from the International Phonetic Alphabet (IPA) to allow simpler adaptation to different languages. Experimental results show a good subjective acceptance of the acquired vocalization in terms of quality, naturalness and synchronization. Although the algorithm has been implemented on a virtual talking face, it could eventually be used also in mechanical vocalization systems. I. I NTRODUCTION The interest in humanoid robotics is increasing fast. A humanoid is a robot designed to work with humans as well as for them. Humanoid robots have been designed to give better services to the human beings. Therefore, a fundamental issue in humanoid robotics is the interaction with humans. At MIT the Cog project [1] was developed under the hypothesis that humanoid intelligence requires humanoid interactions with the world. The simpler way of communication between humans and humanoids is speech. In fact, speech is a natural way of communication between humans in real world. Therefore, it should be also a natural way for communication between humans and humanoids. Human-humanoid interaction by voice requires that on one hand the humanoid should recognize the human utterances and on the other hand that it shouldrespond to the human using artificial vocalization. By artificial vocalization we mean that the message delivered to the human is made by artificial speech and by the corresponding movements of the articulatory organs. Let us look at our model of verbal/facial communication reported in Fig. 1. Adapting Fig. 1 to an humanoid, the Concept Formation Message Generation Phonatory Control Phonatory Organs Abstraction Level Symbolic Level Parametric Level Artifi cial Speech Facial Movements Fig. 1. Simple communication scheme. This work was not supported by any organization E. Mumolo is with Faculty of Engineering, University of Trieste, Italy mumolo@units.it M. Nolich is with the IFACE s.r.l., Trieste, Italy mnolich@units.it concept formation and message generation modules may be located at the cognitive level of the robot, while the control and phonatory modules are located at the vocalization level. Some researchers, for example [2][3], faced the vocaliza- tion problem from a mechanical point of view. They have developed a mechanical replica of the human vocal tract, of the larynx and of the tongue and are trying to control their movements for producing mechanically generated artificial speech using various open and closed loop strategies. In the closed loop strategy, for example, the extraction of control parameters from speech is optimized by minimizing the distance between original and generated speech. Our work is inspired by the work described in [2][3] in the sense that we extract from the input utterances some parameters that could be used to generate an artificial replica of the real utterances on one hand and to control phonatory organs in order to produce that utterance from the other hand. To this end, we extract articulatory parameters using a fuzzy model of the vocal tract. The related algorithms have been introduced in [4][5]. In these works, the mechanical generation of speech is substituted by a virtual model due to the unavailability of a mechanical system. The virtual system, however, leads to important results in the generation of artificial vocalization both in terms of speech and facial movements. The actual paper make a breakthrough over the algorithms described in [4][5]. The major points of improvement, which are the contributions of this paper, are the following: while the rules adopted in [4][5] were tailored on the Italian language, the rules developed in this paper use the International Phonetic Alphabet (IPA) [6], thus leading to the possibility to extend the system to other languages; while in [4][5] we used single words and phrases, in this paper we describe a system able to produce vocalization from unrestricted text; the algorithm uses the average pitch extracted from the tutor as a base of the artificial prosody; we include in this paper extensive subjective evaluations of the proposed system that were impossible to perform previously. The unrestricted text vocalization is performed by ac- quiring, in an initial training phase, a data base of small speech units suitably defined. The acquisition is performed automatically using the speech of a human tutor. The tutor is asked to pronounce a number of given utterances which are analyzed and automatically segmented in the defined small units. Our algorithm, inspired by the human process of language