Dash, Niladri Sekhar (2007) “Frequency-based analysis of words and morphemes in Bengali text corpus”. Indian Journal of Linguistics. Vol. 25. No. 26. Pp. 223-253, 2007. FREQUENCY-BASED ANALYSIS OF WORDS AND MORPHEMES IN BENGALI TEXT CORPUS NILADRI SEKHAR DASH Indian Statistical Institute, Kolkata Email: ns_dash@yahoo.com ABSTRACT Introduction of several application-based new sub-disciplines of linguistics asks for statistical results of language properties since statistical information is found to be useful and necessary inputs for designing tools and systems for language technology, developing course books and text materials for teaching, and for verifying the previously-made theories and observations. Keeping these needs in view, in this paper, we furnished some frequency counts at the word and morpheme level obtained from a corpus of written Bengali texts. Perhaps, this is the first attempt, which has the potential to yield new results to look at the language from a different perspective. The quantitative findings, as well as their qualitative analysis, provide necessary and useful information and examples for designing robust tools and systems of language technology, developing language teaching texts, and compiling dictionaries in Bengali with indispensable empirical bases lacked in traditional works. 1. INTRODUCTION The present paper is a logical continuum of our previous study (Dash 2004) in the sense that we have tried to furnish, in brief, some simple quantitative results of words and morphemes occurring in a sample Bengali corpus of written texts (Dash 2005). The study allows us to discover the varieties of words and morphemes occurring in the language – an issue, which has never been addressed before with the support of an empirical database in the form of a corpus. By looking at these results, we not only find a general panoramic picture about the use of words and morphemes in the language but also come to know about their relative normality and abnormality in use in the language. The use of quantitative information in language study is approved long before the introduction of the electronic corpus. More than half a century ago, Flesch (1946) presented some interesting observations on English words. He argued that in English, words with 1.12 syllable length are ‘very easy’ to comprehend while words with 1.23 syllable length are ‘easy’, words with 1.39 syllable length are ‘fairly easy’, words with 1.47 syllable length are ‘standard’, words with 1.55 syllable length are ‘fairly difficult’, words with 1.67 syllable length are ‘difficult’, and words with 1.92 syllables or more are ‘very difficult’ to comprehend for common language users. Dewey (1923) also studied a corpus to find out the average length of words in some English prose texts. Gibson (1962) used a corpus of literary texts to calculate the average length of words in writings of Shakespeare as well as in the authorized version of the Bible. Similar works may also be credited to Elderton (1949), Herden (1956), Good (1957), Miller, Newman, and Friedman (1958), Edwards and Chambers (1964) and others. After the introduction of language corpora in electronic form, word-level statistical studies are attempted with extra enthusiasm due to easy accessibility of corpora of various types by computer. We may refer to a few such works, which are based on corpora of various types in English. In an interesting study, Leech, Francis, and Xu (1994) examine the existence of non-discrete categories in word meaning in English, while Kilgarriff (1996) examines similarities and differences existing in lexical stocks of different text corpora of English. McEnery and Wilson (1996) use several statistical calculation methods to trace finer shades of lexical distinction underlying between the Brown Corpus and the LOB Corpus. Xu (1996) also uses statistics to measure the average length of English words used in several corpora of English belonging to various disciplines. Biber, Conrad, and Reppen (1998) use different statistical methods to count the frequency of occurrence of different linguistic