Copyright © 2014 IJEIR, All right reserved
248
International Journal of Engineering Innovation & Research
Volume 3, Issue 3, ISSN: 2277 – 5668
Marathi Isolated Words Speech Database for
Agriculture Purpose
Pukhraj P. Shrishrimal
Email: pukhraj.shrishrimal@gmail.com
Ratnadeep R. Deshmukh
Email: rrdeshmukh.csit@bamu.ac.in
Vishal B. Waghmare
Email: vishal.b.waghmare@ieee.org
Abstract – The research in the domain of the language
technologies for Indian languages is far behind than the
languages of developed nation. The work for the Indo-Aryan
language, i.e. Marathi is behind. Development of speech
database is the basic need for developing an automatic speech
recognition system. The accuracy of speech recognition
depends on the quality of the speech data collected and the
quality of training set data. This paper describes the progress
in the development of isolated words Speech database of
Marathi language for agriculture purpose.
Keywords – Speech Database, Speech Recognition, Marathi
Language, Isolated Words, Speech Corpus.
I. INTRODUCTION
There are different means of communication by which
human communicate with each other such as writing,
speech and Sign Language. The communication between
human is dominated by speech. It is the most prominent
and common way to pass message between human. There
are number of languages that are spoken around the world.
The humans have thought about use of speech as a mode
of communication between human and computer since
long time.
Speech has the potential of being used as a mode of
interaction between human and computer. Human beings
have long been motivated to create computer that can
understand and talk like human. In this direction,
researchers have tried to develop system for analysis and
classification of the speech signals. Since 1960‟s the
researchers are trying to develop system which can record,
interpret and understand human speech [1].
The language technologies may be very useful for a
developing country like India. The systems which can
understand and interpret speech can prove very efficient in
the field of agriculture, health, education and e-
governance. The information is today‟s world is only
accessible to those who are technologically literate and the
information is in a specific language. The language
technologies can be very useful to serve as a natural
interface to access the digital content for those who are not
having knowledge of the technology.
Hindi is the national language of India and there are 22
languages recognized by the constitution of India. Apart
from that there are about 1652 dialects / native languages
which are spoken throughout the country. The 23
languages recognized by the constitution of India are: 1)
Assamese, 2) Bengali, 3) Bodo, 4) Dogri, 5) English, 6)
Gujarati, 7) Hindi, 8) Kannada, 9) Kashmiri, 10) Konkani,
11) Maithili, 12) Malayalam, 13) Manipuri, 14) Marathi,
15) Nepali, 16) Oriya, 17) Punjabi, 18) Sanskrit, 19)
Santali, 20) Sindhi, 21) Tamil, 22) Telugu, 23)Urdu [2].
For a multilingual country like India the language
technologies can play a vital role. Most of the Indian
languages are phonetic in nature. The national language of
India i.e. Hindi along with one of the recognized language
by constitution of India Marathi is written in devanagari
script. If we see the global scenario of speech recognition
systems a lot of work has been completed for English and
various languages of developed nations around the world.
Many research projects have been completed or are under
progress for various languages [3].
There is a lot of scope for the development of speech
recognition system in Indian languages. The work that is
currently under progress is mostly for the national
language Hindi later on for Tamil, Telugu Bangla,
Assamese and Marathi. However the work for these
languages is being carried under the linguistic data
consortium for Indian languages (LDC-IL). They are
working for development of continuous speech
recognition systems [4]. The work for Marathi language is
limited as the work is done mostly at IIT Bombay and
TIFR, Mumbai.
This paper describes the work for the development of an
isolated word speech database in Marathi language. The
organization of the paper goes like Section II describes
about the Marathi language. The section III describes the
development of the text corpus. The details regarding the
speech data collection is discussed in the Section IV.
Section V describes the recording procedure followed and
the problems faced during the development of database is
explained in section VI. Section VII focuses on the
removal of background noise. The conclusion and the
future scope of the work are described in section VIII and
IX respectively.
II. MARATHI LANGUAGE
Marathi is one of the 23 recognized languages by the
constitution of India. It is written in devanagari script
similar to the national language Hindi. The devanagari
script is the script used for writing Sanskrit from which
these languages are been derived.
Marathi is an Indo-Aryan language, spoken by the
Marathi people of western and central India. There were
73 million speakers in 2001 around the world. Marathi has
the fourth largest number of native speakers in India [5].
Marathi is spoken in the complete Maharashtra state
which covers a vast geographical area which consists of 35
different districts. The major dialects of Marathi are called
Standard Marathi and Warhadi Marathi [6]. The other few
sub-dialects are like Ahirani, Dangi, Vadvali, Samavedi,
Khandeshi and Malwani. However, standard Marathi is the