International Journal of Emerging Science and Engineering (IJESE) ISSN: 2319–6378, Volume-2, Issue-2, December 2013 63 Isolated Swahili Words Recognition using Sphinx4 Shadrack K. Kimutai, Edna Milgo, David Gichoya Abstract— Speech recognition is one of the frontiers in Human Computer Interaction. A number of tools used to achieve speech recognition are currently available. One of such tools is Sphinx4 from Carnegie Mellon University (CMU). It has a recognition engine based on discrete Hidden Markov Model (dHMM) and a modular structure making it flexible to a diverse set of requirements. However, most efforts that have been undertaken using this tool are focused on established dialects such as English and French. Despite Swahili being a major spoken language in Africa, literature search indicates that little research has been undertaken in developing a speech recognition tool for this dialect. In this paper, we propose an approach to building a Swahili speech recognizer using Sphinx4 to demonstrate its adaptability to recognition of spoken Swahili words. To realize this, we examined the Swahili language structure and sound synthesis processes. Then, a 40 word Swahili acoustic model was built based on the observed language and sound structures using CMU Sphinxtrain and associate tools. The developed acoustic model was then tested using sphinx4. Keywords: Sphinx4, Swahili Language, Speech Recognition, Hidden Markov Model. I. INTRODUCTION Speech recognition has been an intense area of study. According to [1], researchers are seen approaching the field from various fronts of knowledge such as human sciences, statistics, artificial intelligence, linguistics, acoustic sciences, and information science amongst many other distinguished fronts. Speech can be termed as a subset of sound since it shares all the characteristics of sound. Indeed as observed by [2] human speech usually has a sampling rate that lie between 8Hz to 16 KHz. Speech can in this sense be redefined as a means through which information is relayed through air after its production in the human speech synthesizing organs. Speech harbors many complex features that may not be easily realized unless under close scrutiny. These features include phones, phonemes, coagulation, and segmentation amongst others. Phones is the class of sounds numbering around fifty (50) which are used in all human languages. Phoneme on the other hand is the smallest unit of sound that has a distinct meaning. At this juncture it is appropriate to indicate that identification and documentation of Swahili phones is still going on with the total estimated number varying from 31 to 37 [3]. Articles [3] and [5] observed that Swahili dialect is made up of 32 phones with 5 being vowels. Swahili alphabet can be further grouped as follows: 23 single letters and 9 digraphs [3]. Manuscript received December 2013 Shadrack Kipchirchir Kimutai , IT Department, School of Information Science, Moi University, Eldoret, Kenya Edna Milgo, IT Department, School of Information Science, Moi University, Eldoret, Kenya. David Gichoya, IT Department, School of Information Science, Moi University, Eldoret, Kenya. II. RELATED STUDIES A number of researchers have made effort to develop an ASR‟s for the dialect. However, these efforts have encountered some challenges including lack of a standardized acoustic model for the dialect. This has made research in this area costly. For instance, research in article [4] had to use crowdsourcing to construct their Acoustic model in the dialect. In their research, [4] justified the use of crowd sourcing was due to lack of acoustic model that suited their need. Another study is the Swahili-Text-To-Speech System by [6]. Still, their research did not deal with developing an ASR for the dialect but rather the research was limited to a study that resulted to the development of a system which could read out some Swahili text. Besides these, other co-related studies in the area are noticeable. One such study has undertaken a data driven „part of speech‟ tagging which proves to be quite useful especially when working with bi-gram or tri-gram models [8]. Also related to this, [9] presented an effort that addresses a study on development of open source Spell- checking for Gikuyu dialect which is a Bantu dialect in Kenya It should be noted that both Bantu languages and Swahili shares a lot of morphological characteristics as identified by[5][6] and [9]. III. STATEMENT OF THE PROBLEM Despite Swahili being a major spoken language in Africa, literature search reveals that little research has been undertaken in developing a speech recognition tool for this dialect. It is for this reason that we propose an effort of adapting Sphinx4 ASR for Swahili dialect. Sphinx4 is capable of achieving speech recognition in any given language. However, the challenge of developing an acoustic model and language model of the language for any given dialect is left to any potential researcher who would like to develop an ASR. This is the case, especially when the dialect has not been adapted to any ASR. IV. PROPOSED SOLUTION In this paper we propose an approach to building a Swahili speech recognizer to demonstrate the adaptability of Sphinx4 to recognition of spoken Swahili words. To achieve this, we selected 40 words of which 31 were used as the training sample and the rest (9) used in testing sample. The 31 words were selected based on the phones they contain while the test sample was selected randomly ensuring that they cover the phones presented by the test sample. After the selection of these words, the multiple speech samples of each word was recorded. It is from these speech recordings that the training and development of the acoustic and language model was based on. Once developed, these models was plugged into sphinx4 recognition engine and tested. V. SPHINX4 All ASR‟s in the Sphinx family utilize HMM for method of recognition. According to [12], the first member of the