Goal Babbling of Acoustic-Articulatory Models with Adaptive Exploration Noise Anja Kristina Philippsen Cognitive Interaction Technology Center (CITEC), Bielefeld University aphilipp@techfak.uni-bielefeld.de Ren´ e Felix Reinhart Research Institute for Cognition and Robotics (CoR-Lab), Bielefeld University freinhart@uni-bielefeld.de Britta Wrede Applied Informatics Group, Bielefeld University bwrede@techfak.uni-bielefeld.de Abstract—We use goal babbling to bootstrap a parametric model of speech production for a complex 3D vocal tract model. The system learns to control the articulators for producing five different vowel sounds. Ambient speech influences learning on two levels: it organizes the learning process because it is used to generate a space of goals in which exploration takes place. A distribution learned from ambient speech provides the system with targets during exploration. Previous research with this vocal tract model showed that visual information have to be included for acquiring the vowel [u] via reward-based optimization. We model the learning process instead with goal-directed exploration where all tar- gets are learned in parallel. As some vowels require more exploratory noise in the articulators than others, we propose a mechanism to adapt the noise amplitude depending on the system’s competence in different regions of the goal space. We demonstrate that this self-aware learning leads to more stable results. The implemented system succeeds in acquiring vocalization skills for rounded as well as unrounded vowels using only a single modality. I. I NTRODUCTION Learning how to speak can be interpreted as a motor coordination problem: Infants explore the capabilities of their vocal tract in order to discover articulatory trajectories that produce the desired speech sounds. Although it is still largely a mystery how infants achieve this, there is consensus that babbling plays a crucial role in early speech learning [1], [2]: By producing speech sounds and observing the outcomes, infants gradually learn to coordinate their articulators. To model this development computationally, a system can be equipped with a vocal tract model. Executing this forward model the system can generate speech signals from articulatory configurations in a similar way as infants use their articulators. Acquiring articulatory control can then be defined as learning an inverse model to estimate from the acoustics which articulatory configuration is required to reproduce this sound. Findings in developmental psychology suggest that in- fants explore the space of possible motor configurations not randomly, but with targets in mind [3], [4], [5]. Ac- cordingly, many developmental models of speech acquisition implement vocal learning as an imitative process [6], [7], [8]. Applying active motor exploration, these studies acquire the articulatory configurations to successfully imitate a set of speech sounds in a babbling phase. But speech sounds are learned sequentially. Due to redundancies in the motor system (a number of articulatory trajectories might result in the same speech sound), this approach bears the risk that no inverse model can be trained from the collected acoustic-articulatory pairs [9]. The Elija model [10] removes redundancies after the babbling phase by consolidating the learned motor patterns based on the acoustic consequences. [6] and [7] connect articulation and acoustics via a map of acquired speech sounds. Goal babbling is an exploration mechanism first intro- duced for kinematic motor control learning and resolves such redundancies by learning to achieve several targets in parallel and directly bootstrapping the inverse model during exploration [11], [12]. Goal babbling achieves high efficiency by organizing exploration in the so called goal space, the space of (here: acoustic) outcomes. Studies by Moulin-Frier et al. [13], [14], [15], [16] have applied the idea of goal babbling to the speech domain. By using formant frequencies as goals, they could demonstrate the emergence of articulated speech sounds [16] and the bootstrapping of vowel sounds [13]. In a recent work [17], Liu and Xu used goal babbling to control F0 for Chinese language. A limitation in these works is the low-dimensional acoustic representation that is required to make goal babbling efficient. Actually, speech is a very high-dimensional signal, as it exposes high variability in the spectral as well as in the temporal domain. In [18], we presented an approach to overcome this limita- tion by generating a goal space from high-dimensional acous- tic features via dimension reduction. This method follows the idea that infants’ learning is influenced by the ambient language which they perceive from their environment [19], [5]. Here, we extend this model and use it to learn artic- ulatory control for imitating five vowel sounds with the 3D articulatory speech synthesizer VocalTractLab (VTL) [20]. In VTL, the vocal tract shape is determined by 20 articulatory (and additional glottis) parameters. It is physiologically more natural and produces better intelligible sound than the Diva or the Praat articulatory synthesizer that most speech acquisition models use [6], [10], [16], [17], [18], [21]. It causes also