Ezafe Prediction in Phrases of Farsi Using CART Abbas Koochari a1 , Behrang QasemiZadeh b , Mojtaba Kasaeiyan a a Amirkabir University, Tehran, Iran b Iran University of Science and Technology, Narmak, Tehran, Iran Persian, also known as Farsi, is the official language of Iran and Tajikistan and one of the two main languages Spoken in Afghanistan. Persians adopted a unified Arabic script for writing. In consequence, short vowels usually are not written in Farsi text. On the other hand, Ezafe marker, the genitive marker of Farsi, usually appears as a short vowel in Farsi written texts and it is not written in texts; so, Farsi written text, is a series of consecutive nouns without any overt links or boundaries. This can be lead to complexity and inefficiency of Farsi Syntactic analysis. The paper introduces a corpus based method to detect Ezafe marker in Farsi written texts to overcome the mentioned problem. Classification and Regression Tree (CART) has been used to predict the presence or absence of Ezafe marker. Evaluation shows promising results of our approach. Keywords: Persian, Natural Language Processing, CART, Corpus 1 INTRODUCTION Farsi, also known as Persian, is the official language of Iran, Tajikistan and one of the two main languages spoken in Afghanistan. Farsi is a member of the Indo-Iranian family of the Indo-European languages and it has the properties of agglutinative languages. [1][2] After the Arab's conquest in 651 A.D., the Persians adopted an extension of unified Arabic script for writing [3]. Salient characteristics of Arabic script are: existence of various connecting letters, varying graphic forms for many letters depending on their position in a word, varying letter width, absence of full size characters for vowels (vowels are represented as particular signs above and below characters), existence of a number of digraphs and composite letters, writing direction from right to left and absence of upper case and lower case letters. As Farsi uses an extension of Arabic writing system, short vowels are not written in Farsi texts. [4] In Farsi, the Ezafe Marker, a suffix that connects the elements in a phrase, may accompany nouns and adjectives. Ezafe Marker usually appears as a short vowel named Kasre, which sounds "e". Ezafe Marker acts like "`s" (e.g., John's car) or the preposition "of" (her brother's car) in English. A noun that is accompanied by Ezafe Marker, can be considered as the genitive case of that noun. According to the Farsi orthography, Due to the fact that Ezafe Marker appears as a short vowel, it is possible that it is omitted from written text. The result, in Farsi written text, is a series of consecutive nouns without any overt links or boundaries as shown in the following example [5]: Māshīn dūst brādr Ali Car friend brother Ali Ali's brother's friend's car The actual pronunciation for this example in spoken language is “Māshīn-e dūst-e barādar-e Ali”, where the Ezafe marker is represented by the e. [5] One of the important issues about Ezafe marker is that all elements occurring between the head noun and the possessor noun phrase are linked to the head noun and to one another by Ezafe marker. In addition, within an adjectival phrase, Ezafe may link the adjectival head to its unique complement. There are several studies about Ezafe in Farsi from linguistics point of view. Samiian [6] is the first detailed study on Ezafe in Persian within a modern syntactic framework, namely X-bar theory. The empirical facts mentioned by Samiian have been taken 1 Corresponding Author: Abbas Koochari, a_koochari@yahoo.com