Automatic measurement and analysis of the child verbal communication using classroom acoustics within a child care center Maryam Najaﬁan 1 , Dwight Irvin 2 , Ying Luo 2 , Beth S. Rous 2 , John H. L. Hansen 1 1 Center for Robust Speech Systems 2 College of Education, University of Texas at Dallas, Richardson, TX, USA University of Kentucky, Lexington, KY, USA [m.najaﬁan,john.hansen]@utdallas.edu [dwight.irvin,ying.luo,beth.rous]@uky.edu Abstract Understanding the language environment of early learners is a challenging task for both human and machine, and it is criti- cal in facilitating effective language development among young children. This papers presents a new application for the existing diarization systems and investigates the language environment of young children using a turn taking strategy employing an i-vector based baseline that captures adult-to-child or child-to- child conversational turns across different classrooms in a child care center. Detecting speaker turns is necessary before more in depth subsequent analysis of audio such as word count, speech recognition, and keyword spotting which can contribute to the design of future learning spaces speciﬁcally designed for typ- ically developing children, or those at-risk with communica- tion limitations. Experimental results using naturalistic child- teacher classroom settings indicate the proposed rapid child- adult speech turn taking scheme is highly effective under noisy classroom conditions and results in 27.3% relative error rate re- duction compared to the baseline results produced by the LIUM diarization toolkit. Index Terms: child speech, speech turn taking, language envi- ronment analysis 1. Introduction The quality and number of interactions that accompany a rich language environment contribute to essential language develop- mental outcomes in early childhood [1]. For humans, analyzing the large quantity of data is not practical and building real-time solutions that provide actionable analysis is cost-prohibitive. On the other hand for machines, scaling to process large quan- tities of data is possible but there is a need to develop robust speech processing and location tracing systems that can bring consistency and reliability to the analysis. In this study, we em- ploy the LENA recording device [2, 3] and robust analytical algorithms to lay the foundation for a machine-based solution. In this study we recorded and tracked the location of 33 children of age 2.5 to 5 years old across 4 classrooms in a high- quality childcare center in the United States at various time points during the day. We aim to determine how much of the child’s interaction involves other children and how much is from classroom teachers. For this purpose a speech activity detector followed by a diarization is required to be able to detect fast turn changes during child-adult conversations. Our motivation is to automate assessment of child’s lan- guage environment, which may assist automatic monitoring of child language acquisition and development progress. In this study we describe the current state of the algorithms applied to similar tasks and their drawbacks for our current application. next, we propose a system which addresses challenges and cat- 0.3 0.4 0.5 0.6 Adult Prim. child Sec. child Non-speech 0 0.1 0.2 0.3 Figure 1: Percentage of different classes of data in the database egorizes every 1.5 seconds of data into four main categories: (i) speech produced by the child (ii) speech directed towards the child by an adult, (iii) speech directed towards the child by another child and (iv) non-speech (the stream of background music or conversation by other adults and children). Next we provide an example analysis carried out on the data from a typ- ical children classroom (e.g 15-20 students and 1-3 teachers) at three different time points. Finally, we present the confusion matrix for the child-adult turn taking system and present a con- clusion and layout of our future work. 2. Speech database description For speech data collection a light weight compact digital audio recorder (LENA device [2, 3]) is worn by 33 children of age 2.5 to 5 years old. The audio is recorded throughout a typical day at a childcare center, during at least one of the three different time points, where the subject was participating in different ac- tivities. We used 4.5 hours of audio recording gathered by the LENA unit attached to 18 children (approximately 15 minutes each) to train our speech analysis systems. In our experiments we used a 3-fold cross validation so no speaker appeared simul- taneously in the training and test sets. Table 1: Ground-truth analysis for the dataset Segment class Average duration Average turn duration Primary child 1.9s 1.8s Secondary child 1.8s 1.6s Adult 2.2s 2.1s For system evaluation purpose, this data was partitioned into approximately 1.5 second segments and each cut was la- beled correspondingly. From the manual labels gathered, we es-