Developing a Children’s Filipino Speech Corpus for Application in Automatic Detection of Reading Miscues and Disfluencies Ronald M. Pascual 1 and Rowena Cristina L. Guevara 2 Digital Signal Processing Laboratory, Electrical and Electronics Engineering Institute, University of the Philippines Diliman 1 ronaldmpascual@gmail.com, 2 gev@eee.upd.edu.ph Abstract— Recognizing the potential benefit that the current speech processing technology offers to improve children’s literacy, researchers in the past few years have devoted their efforts in developing reading miscue detectors (RMDs) and automated reading tutors (ARTs). A primary challenge however in developing speech technologies for children may be the unavailability of a dedicated children’s speech corpus that can be used for system design and test. In the past few years, children’s speech corpora have been developed for languages such as English, Dutch, Chinese Mandarin, Italian, German and Swedish. But since Filipino has features and orthography that are distinct from other languages, the focus of this study is the development of a children’s Filipino speech corpus (CFSC). In this paper, we present the CFSC design, reading text, data collection procedure and speech transcription method. We also performed initial analysis of the reading miscues and disfluencies found in the CFSC. The results of the miscue analysis suggest possible ways for modeling the reading miscues and possible methods for detecting them. Among these methods are acoustic model likelihood calculation and analysis of duration-based prosodic features. The CFSC presented in this study will be used for the development of an RMD and an ART for Filipino. Keywords – speech technology for children; children’s speech corpus; reading miscue detector; automated reading tutor; Filipino speech I. INTRODUCTION Recognizing the potential benefit that the current speech processing technology offers to improve children’s literacy, researchers in the past few years have devoted their efforts in the development of computer-assisted oral reading assessment and learning systems such as reading miscue detector (RMD) and automated reading tutor (ART). The studies in [1], [2] and [3] for instance presented three different RMD system designs respectively for English, Chinese Mandarin and Dutch languages, while the studies in [4] and [5] respectively presented the Project LISTEN’s reading tutor and the Colorado Literacy Tutor, two of the most popular ART systems for English. The lack of equivalent studies for the Filipino language has motivated us to collect a children’s Filipino speech corpus (CFSC) as the first phase of a project * on the development of an ART system for children in the Philippines. The current problems that the Philippine primary education system are facing, such as the poor reading performance of the students and the shortage of teachers, inspired us to focus on the development of a computer-assisted learning system for Filipino children. According to the results of national achievement tests in the past few years and the reviews conducted by international organizations, the general quality of primary education in the Philippines falls below standard [6, 7, 8]. A primary challenge in developing speech technologies for children may be the unavailability of a dedicated children’s speech corpus that can be used for system design and test. In the recent years, the automatic speech recognition (ASR) community became aware of this need for developing children’s speech corpora in order to improve the ASR system’s performance, which appeared to be generally poorer in children’s speech as compared with adults’ speech [9]. In fact, some studies in the past few years such as in [10], [11] and [12] have focused on the development of children’s speech corpora in languages such as English, Dutch, Italian, German and Swedish. It was noted in [9] that although an ASR system trained on adults’ speech can employ an adaptation technique in order to improve its performance in processing children’s speech, it is unlikely that its performance will exceed that of an equivalent system trained on children’s speech. In this paper, we focus on the development of a children’s speech corpus in Filipino, the national language of the Philippines and a language used in the Philippine basic education system. Since Filipino has features and orthography that are distinct from other languages, there is an apparent need to develop a language-specific children’s speech corpus. In terms of speech rhythm for instance, it has been shown that Filipino is generally classified as a syllable-timed language [13]. Unlike stress-timed languages such as English, in syllable-timed languages, every syllable is perceived as taking up roughly the same amount of time, except for small variations due to the prosody. We emphasize that the CFSC being presented in this paper is designed for the primary purpose of developing an RMD system for children’s oral reading in Filipino. With the aforementioned purpose, the CFSC also contains reading miscues and disfluencies that have naturally occurred in the participants’ read speech. Although it is possible to simulate *This research was funded by the CHEDSEGS grant from the Commission on Higher Education of the Philippines, in collaboration with the Office of the Vice-Chancellor for Research and Development of University of the Philippines Diliman.