Goal Babbling of Acoustic-Articulatory Models with Adaptive Exploration Noise Anja Kristina Philippsen Cognitive Interaction Technology Center (CITEC), Bielefeld University aphilipp@techfak.uni-bielefeld.de Ren´ e Felix Reinhart Research Institute for Cognition and Robotics (CoR-Lab), Bielefeld University freinhart@uni-bielefeld.de Britta Wrede Applied Informatics Group, Bielefeld University bwrede@techfak.uni-bielefeld.de Abstract—We use goal babbling to bootstrap a parametric model of speech production for a complex 3D vocal tract model. The system learns to control the articulators for producing ﬁve different vowel sounds. Ambient speech inﬂuences learning on two levels: it organizes the learning process because it is used to generate a space of goals in which exploration takes place. A distribution learned from ambient speech provides the system with targets during exploration. Previous research with this vocal tract model showed that visual information have to be included for acquiring the vowel [u] via reward-based optimization. We model the learning process instead with goal-directed exploration where all tar- gets are learned in parallel. As some vowels require more exploratory noise in the articulators than others, we propose a mechanism to adapt the noise amplitude depending on the system’s competence in different regions of the goal space. We demonstrate that this self-aware learning leads to more stable results. The implemented system succeeds in acquiring vocalization skills for rounded as well as unrounded vowels using only a single modality. I. I NTRODUCTION Learning how to speak can be interpreted as a motor coordination problem: Infants explore the capabilities of their vocal tract in order to discover articulatory trajectories that produce the desired speech sounds. Although it is still largely a mystery how infants achieve this, there is consensus that babbling plays a crucial role in early speech learning [1], [2]: By producing speech sounds and observing the outcomes, infants gradually learn to coordinate their articulators. To model this development computationally, a system can be equipped with a vocal tract model. Executing this forward model the system can generate speech signals from articulatory conﬁgurations in a similar way as infants use their articulators. Acquiring articulatory control can then be deﬁned as learning an inverse model to estimate from the acoustics which articulatory conﬁguration is required to reproduce this sound. Findings in developmental psychology suggest that in- fants explore the space of possible motor conﬁgurations not randomly, but with targets in mind [3], [4], [5]. Ac- cordingly, many developmental models of speech acquisition implement vocal learning as an imitative process [6], [7], [8]. Applying active motor exploration, these studies acquire the articulatory conﬁgurations to successfully imitate a set of speech sounds in a babbling phase. But speech sounds are learned sequentially. Due to redundancies in the motor system (a number of articulatory trajectories might result in the same speech sound), this approach bears the risk that no inverse model can be trained from the collected acoustic-articulatory pairs [9]. The Elija model [10] removes redundancies after the babbling phase by consolidating the learned motor patterns based on the acoustic consequences. [6] and [7] connect articulation and acoustics via a map of acquired speech sounds. Goal babbling is an exploration mechanism ﬁrst intro- duced for kinematic motor control learning and resolves such redundancies by learning to achieve several targets in parallel and directly bootstrapping the inverse model during exploration [11], [12]. Goal babbling achieves high efﬁciency by organizing exploration in the so called goal space, the space of (here: acoustic) outcomes. Studies by Moulin-Frier et al. [13], [14], [15], [16] have applied the idea of goal babbling to the speech domain. By using formant frequencies as goals, they could demonstrate the emergence of articulated speech sounds [16] and the bootstrapping of vowel sounds [13]. In a recent work [17], Liu and Xu used goal babbling to control F0 for Chinese language. A limitation in these works is the low-dimensional acoustic representation that is required to make goal babbling efﬁcient. Actually, speech is a very high-dimensional signal, as it exposes high variability in the spectral as well as in the temporal domain. In [18], we presented an approach to overcome this limita- tion by generating a goal space from high-dimensional acous- tic features via dimension reduction. This method follows the idea that infants’ learning is inﬂuenced by the ambient language which they perceive from their environment [19], [5]. Here, we extend this model and use it to learn artic- ulatory control for imitating ﬁve vowel sounds with the 3D articulatory speech synthesizer VocalTractLab (VTL) [20]. In VTL, the vocal tract shape is determined by 20 articulatory (and additional glottis) parameters. It is physiologically more natural and produces better intelligible sound than the Diva or the Praat articulatory synthesizer that most speech acquisition models use [6], [10], [16], [17], [18], [21]. It causes also