A Sandhi Splitter for Malayalam Devadath V V Litton J Kurisinkel Dipti Misra Sharma Vasudeva Varma Language Technology Research Centre International Institute of Information Technology - Hyderabad, India. {devadathv.v,litton.jkurisinkel}@research.iiit.ac.in {dipti,vv}@iiit.ac.in Abstract Sandhi splitting is the primary task for computational processing of text in San- skrit and Dravidian languages. In these languages, words can join together with morpho-phonemic changes at the point of joining. This phenomenon is known as Sandhi. Sandhi splitter splits the string of conjoined words into individual words. Accurate execution of sandhi splitting is crucial for text processing tasks such as POS tagging, topic modelling and doc- ument indexing. We have tried differ- ent approaches to address the challenges of sandhi splitting in Malayalam, and fi- nally, we have thought of exploiting the phonological changes that take place in the words while joining. This resulted in a hy- brid method which statistically identifies the split points and splits using predefined character level linguistic rules. Currently, our system gives an accuracy of 91.1% . 1 Introduction Malayalam is one among the four main Dravidian languages and 22 official languages of India. It is spoken in the State of Kerala, which is situated in the south west coast of India . This language is believed to be originated from old Tamil, hav- ing a strong influence of Sanskrit in its vocabulary. Malayalam is an inflectionally rich and agglutina- tive language like any other Dravidian language. The property of agglutination eventually leads to the process of sandhi. Sandhi is the process of joining two words or characters, where morphophonemic changes occur at the point of joining. The presence of Sandhi is abundant in Sanskrit and all Dravidian languages. When compared to other Dravidian languages, the presence of Sandhi is relatively high in Malayalam. Even a full sentence may exist as a single string due to the process of Sandhi. For example, Ae\ncnWm (avanaaraaN) is a sentence in Malayalam which means “Who is he ?”. It is composed of 3 independent words, namely Ae°(avan (he)), Bcm(aar(who)) and BWm (aaN(is)). However, ambiguous splits for a word is very less in Malayalam. Sandhis are of two types, Internal and External. Internal Sandhi exists between a root or a stem with a suffix or a morpheme. In the example given below, ]l(para)+ Dè(unnu)= ]lÆè(parayunnu) Here ]l(para) is a verb root with the mean- ing “to say” and Dè(unnu) is an inflectional suffix for marking present tense. They join together to form ]lÆè(parayunnu), meaning “say”(PRES). External sandhi is between words. Two or more words join to form a single string of conjoined words. tNåw(ceyyuM)+ F¹o²(enkil) = tNåta¹o² (ceyyumenkil) tNåw(ceyyuM) is a finite verb with the meaning “will do” and F¹o²(enkil) is a connective with meaning “if”. They join together to form a single string tNåta¹o²(ceyyumenkil). For most of the text processing tasks such as POS tagging, topic modelling and document in- dexing, External Sandhi is a matter of concern. All these tasks require individual words in the text to