Copyright © 2014 IJEIR, All right reserved 248 International Journal of Engineering Innovation & Research Volume 3, Issue 3, ISSN: 2277 – 5668 Marathi Isolated Words Speech Database for Agriculture Purpose Pukhraj P. Shrishrimal Email: pukhraj.shrishrimal@gmail.com Ratnadeep R. Deshmukh Email: rrdeshmukh.csit@bamu.ac.in Vishal B. Waghmare Email: vishal.b.waghmare@ieee.org Abstract – The research in the domain of the language technologies for Indian languages is far behind than the languages of developed nation. The work for the Indo-Aryan language, i.e. Marathi is behind. Development of speech database is the basic need for developing an automatic speech recognition system. The accuracy of speech recognition depends on the quality of the speech data collected and the quality of training set data. This paper describes the progress in the development of isolated words Speech database of Marathi language for agriculture purpose. Keywords – Speech Database, Speech Recognition, Marathi Language, Isolated Words, Speech Corpus. I. INTRODUCTION There are different means of communication by which human communicate with each other such as writing, speech and Sign Language. The communication between human is dominated by speech. It is the most prominent and common way to pass message between human. There are number of languages that are spoken around the world. The humans have thought about use of speech as a mode of communication between human and computer since long time. Speech has the potential of being used as a mode of interaction between human and computer. Human beings have long been motivated to create computer that can understand and talk like human. In this direction, researchers have tried to develop system for analysis and classification of the speech signals. Since 1960‟s the researchers are trying to develop system which can record, interpret and understand human speech [1]. The language technologies may be very useful for a developing country like India. The systems which can understand and interpret speech can prove very efficient in the field of agriculture, health, education and e- governance. The information is today‟s world is only accessible to those who are technologically literate and the information is in a specific language. The language technologies can be very useful to serve as a natural interface to access the digital content for those who are not having knowledge of the technology. Hindi is the national language of India and there are 22 languages recognized by the constitution of India. Apart from that there are about 1652 dialects / native languages which are spoken throughout the country. The 23 languages recognized by the constitution of India are: 1) Assamese, 2) Bengali, 3) Bodo, 4) Dogri, 5) English, 6) Gujarati, 7) Hindi, 8) Kannada, 9) Kashmiri, 10) Konkani, 11) Maithili, 12) Malayalam, 13) Manipuri, 14) Marathi, 15) Nepali, 16) Oriya, 17) Punjabi, 18) Sanskrit, 19) Santali, 20) Sindhi, 21) Tamil, 22) Telugu, 23)Urdu [2]. For a multilingual country like India the language technologies can play a vital role. Most of the Indian languages are phonetic in nature. The national language of India i.e. Hindi along with one of the recognized language by constitution of India Marathi is written in devanagari script. If we see the global scenario of speech recognition systems a lot of work has been completed for English and various languages of developed nations around the world. Many research projects have been completed or are under progress for various languages [3]. There is a lot of scope for the development of speech recognition system in Indian languages. The work that is currently under progress is mostly for the national language Hindi later on for Tamil, Telugu Bangla, Assamese and Marathi. However the work for these languages is being carried under the linguistic data consortium for Indian languages (LDC-IL). They are working for development of continuous speech recognition systems [4]. The work for Marathi language is limited as the work is done mostly at IIT Bombay and TIFR, Mumbai. This paper describes the work for the development of an isolated word speech database in Marathi language. The organization of the paper goes like Section II describes about the Marathi language. The section III describes the development of the text corpus. The details regarding the speech data collection is discussed in the Section IV. Section V describes the recording procedure followed and the problems faced during the development of database is explained in section VI. Section VII focuses on the removal of background noise. The conclusion and the future scope of the work are described in section VIII and IX respectively. II. MARATHI LANGUAGE Marathi is one of the 23 recognized languages by the constitution of India. It is written in devanagari script similar to the national language Hindi. The devanagari script is the script used for writing Sanskrit from which these languages are been derived. Marathi is an Indo-Aryan language, spoken by the Marathi people of western and central India. There were 73 million speakers in 2001 around the world. Marathi has the fourth largest number of native speakers in India [5]. Marathi is spoken in the complete Maharashtra state which covers a vast geographical area which consists of 35 different districts. The major dialects of Marathi are called Standard Marathi and Warhadi Marathi [6]. The other few sub-dialects are like Ahirani, Dangi, Vadvali, Samavedi, Khandeshi and Malwani. However, standard Marathi is the