Automatic Assessment of Language Background in Toddlers Through Phonotactic and Pitch Pattern Modeling of Short Vocalizations Hynek Boˇ ril 1 , Qian Zhang 1 , Ali Ziaei 1 , John H. L. Hansen 1∗ , Dongxin Xu 2 , Jill Gilkerson 2 , Jeffrey A. Richards 2 , Yiwen Zhang 3 , Xiaojuan Xu 3 , Hongmei Mao 3 , Lei Xiao 3 , Fan Jiang 3 1 Center for Robust Speech Systems (CRSS), University of Texas at Dallas, U.S.A. 2 LENA Foundation, Boulder, Colorado, USA 3 Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, No. 1678 Dong Fang Road, Shanghai 200127, P.R. China {hynek,qian.zhang,ali.ziaei,john.hansen}@utdallas.edu {dongxinxu,jillgilkerson,jeffrichards}@lenafoundation.org zhangyiwen@hotmail.com Abstract This study utilizes phonotactic and pitch pattern model- ing for automatic assessment of toddlers’ language background from short vocalization segments. The experiments are con- ducted on audio recordings of twelve 25–31 months old US- born and Shanghainese toddlers. Each recording captures a whole-day sound track of an ordinary day in the toddlers’ life spent in their natural environment. In a preliminary study, we observed that in spite of the limited presence of linguistic con- tent in the early age child vocalizations, certain phonotactic and prosodic patterns were correlated with the child’s language background. In the current effort, we analyze to what extent these language-salient cues can be leveraged in the context of automatic language background classiﬁcation. Besides a tradi- tional parallel phone recognition with statistical language mod- eling (PPRLM) and phone recognition with support vector ma- chines (PRSVM), a novel scheme that utilizes pitch patterns (PPSVM) is proposed. The classiﬁcation results on very short vocalizations (on average less than 3 seconds long) conﬁrm that both phonotactic and prosodic features capture a language- speciﬁc content, reaching equal error rates (EER) of 32.45 % for PRSVM, 31.33 % for PPSVM, and 29.97 % in a fusion of PRSVM and PPSVM systems. The competitive performance of PPSVM suggests that pitch contours carry a signiﬁcant portion of the language-speciﬁc information in toddlers’ vocalizations. Index Terms: language background assessment, toddlers, child vocalization, phonotactic modeling, pitch patterns, PPRLM, PRSVM, PPSVM. 1. Introduction Thanks to the recent breakthroughs in speech technology, the role of voice interfaces has been gradually extending from an imperfect replacement of a computer keyboard to sophisticated applications in biometrics (user authentication, forensics), ed- ∗ This project was funded by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineer- ing held by J.H.L. Hansen. ucation (language learning), and health care (speech-language pathology). While a major part of the research in automatic speech processing has been focused on adult users, recent stud- ies demonstrate its great potential also for children-oriented tasks such as detection of language delay [1], early communica- tion disorders [2], autism [3], computer-aided reading tutoring [4, 5], or emotional state assessment [6]. Other studies focus on automatic assessment of the children vocal development [7] and on boosting the process of early language learning [8]. Our recent study [9] has focused on the analysis of vo- calizations from children with American English (AE) and Shanghainese (Shang) language backgrounds. While the study noted differences in the phonotactic and prosodic domains for the two language backgrounds, it is not clear whether the observed statistical differences are signiﬁcant and consistent enough to be leveraged in language background discrimina- tion. The main objective of the present study is to design an automatic language background assessment scheme utilizing phonotactics and pitch patterns and verify the signiﬁcance of the background-speciﬁc production differences in a quantitative way. In addition to investigating the role of the two production domains in toddlers’ background discrimination, the study aims at advancing the technology for children speech assessment that can beneﬁt future automated child-computer interfaces with ap- plications such as automatic detection of language switching in multi-lingual children or language acquisition assessment. State-of-the-art language recognition systems for adult speech, as seen in recent National Institute of Standards and Technology Language Recognition Evaluation (NIST-LRE) [10] submissions, typically utilize one or a combination of several of the following strategies: cepstral coefﬁcients with shifted delta cepstra (SDC) [11], Gaussian mixture modeling with universal background models (GMM-UBM) and GMM supervectors [12] and i-vectors [13], phonotactic models re- alized by parallel phone recognizers and language modeling (PPRLM) [14], and phone recognizers combined with support vector machines (PRSVM) [15, 16]. In our study, PPRLM and PRSVM systems are used for