(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 453 | Page www.ijacsa.thesai.org A Novel Readability Complexity Score for Gujarati Idiomatic Text Jatin C. Modh 1 Research Scholar Gujarat Technological University Ahmedabad, India Jatinderkumar R. Saini 2 * Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed University) Pune, India Ketan Kotecha 3 Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University) Pune, India Abstract—Gujarati language is used for conversation by more than 55 million people worldwide and it is more than 1000 years old language. It is the chief language of the Indian state of Gujarat. There are many dialects of Gujarati like Standard Gujarati, Amdawadi Gujarati, Kathiawadi Gujarati, Kutchi Gujarati etc. The Gujarati language is very rich in morphology like other Indo-Aryan languages like Hindi. Many readability tests are available in the English language, but no readability complexity test is available for the Gujarati idiomatic text. The Complexity score is the sub concept of the readability test. In order to define complexity level of Gujarati text, complexity score of Gujarati text is calculated. We deployed a novel readability complexity score calculation method in which we considered the number of letters of each word, the number of diacritics of each word, Gujarati idiomatic text of n-gram where n=1 to 9, Gujarati idiomatic text of m-meaning idioms where m=1 to 7. The complexity score is calculated as the sum of word complexity score, diacritics complexity score, n-gram complexity score of Gujarati idioms and m-meaning complexity score of Gujarati idioms. We emphasized Gujarati idiomatic text for the calculation of complexity score as idioms make the text more complex to understand. This is an innovative and first of its kind work in the research community of Gujarati language. The results are hopeful enough to employ the suggested complexity score method for developing a readability test method for natural language processing tasks for the Gujarati language. Keywords—Complexity; Gujarati; idiomatic text; natural language processing (NLP); readability I. INTRODUCTION Gujarati language is named after the people of Gurjar people who are said to have established in the middle of the 5th century CE. Gujarati language is used by more than 55 million people worldwide and it is more than 1000 years old language based on Indo-Aryan languages. Gujarati language stands in 26th position among the most spoken native language in the world. Gujaratis are spread all over the world. It is the chief language of the Indian state of Gujarat. It is also main language in the union territories of Daman and Diu, Dadra and Nagar Haveli. Outside of India, it is spoken all over the world in many countries like United States, Canada, UK, Southeast African countries etc. There are many dialects of Gujarati like Standard Gujarati, Amdawadi Gujarati, Kathiawadi Gujarati, Kutchi Gujarati etc. The spelling of Gujarati words is based on pronunciation [1][2]. A. Gujarati Script Gujarati is written similar to the Devanagari script except it does not have the horizontal line above characters. The Gujarati alphabet has mainly 34 consonants, 13 vowels and 10 digits working as a building block of the Gujarati language. Sarth Gujarati dictionary consists more than 65000 words excluding technical or slang words [3]. Gujarat vowels and Gujarati consonants can be written as independent letters or by combining with diacritic marks. Diacritics play a very important role in building meaningful words and thus vocabulary of the Gujarati language. Fig. 1 shows the use of diacritics with the letter ત. Gujarati diacritics and conjuncts make Gujarati script more effective for written and communication purposes [4][5]. B. Gujarati idioms An idiom is a group of words but whose meaning is established by the usage and not as the literal meaning of its separate words. Gujarati people are using Gujarati idioms for expressing thoughts, feelings and messages. Gujarati idioms are not understandable for non-Gujarati people as well as for children of a lower standard. Gujarati idioms can be understood by the surrounding context information [6]. Gujarati idioms can be classified on the base of N-grams and on the base of the number of m-meanings [8]. Gujarati idioms can also be classified as static idioms versus inflected idioms. Here we consider idioms as unfamiliar words. Example of Gujarati idiom is જલ ઱ેળુ „jala levum‟ i.e. to take a vow. It is bigram/2-gram and single-meaning idiom. C. Text Complexity English language consists of 26 alphabets with 21 consonants and 5 vowels for writing. Generally, three aspects are used to decide the complexity of the English text: quantitative measures, qualitative measures and concerns involving to the reader and task [7]. The Gujarati language is morphologically very rich compared to the English language. The Gujarati language consists of 18 diacritics [6]. Diacritics make many possible word formations by suffixing or prefixing any letter. Using diacritics various inflectional forms are possible for Gujarati verbs and Gujarati nouns [9]. Here only quantitative measures are considered for complexity as our text is just in written form. Factors such as sentence, word length and the frequency of unfamiliar words are used as quantitative measures of text complexity. *Corresponding Author