An Exhaustive Rule-Based Affix Extraction for Stemming in Tagalog Laurenz Adriel Tolentino College of Computer Studies De La Salle University Philippines laurenz_tolentino@dlsu.edu.ph Allan Borra College of Computer Studies De La Salle University Philippines allan.borra@dlsu.edu.ph ABSTRACT This paper presents an exhaustive rule-based stemming ap- proach for extracting affixes in Tagalog. The approach aims to reduce understemming and unstemmed issues that are present in existing works made for Tagalog. This approach performs stemming by generating a tree wherein each node represents a word form derived from the input, each node has three branches that represents bound morpheme posi- tions: prefix, infix, and suffix, and each of the leaves repre- senting root word candidates. Generating the tree removes the need to create a fixed stemming routine which is one of the sources for understemming and unstemmed errors by exhaustively showing all stemming possibilities. A corpus containing 536 sentences with 3,466 unique words of formal and literary Tagalog was sourced from texts such as newspa- per articles, novels, and government reports, which will then be used to test the performance of the proposed exhaustive rule-based approach and an existing work that employs an example-based approach. The exhaustive approach has an accuracy of 74.29% while the example-based approach has a lower accuracy of 60.04%, making the proposed approach better at extracting affixes in Tagalog. Keywords Stemmer; Morphology; Lemmatization; Tagalog; Natural Language Processing 1. INTRODUCTION Tagalog, one of the many languages in the Philippines, is the most widely spoken language in the country with nearly 35.1% of the population use Tagalog as their primary lan- guage and with some regions in the country speak Taga- log as their second or third language aside from English [3]. Languages such as English, Malay, and Spanish have also influenced Tagalog both phonologically and lexically, making Tagalog’s morphosyntactic properties diverse and complex [1]. Morphological phenomena such as affixation (prefix, infix and suffix), reduplication, stress shifting, vowel reduction, and consonant alternation and morphophonemic changes such as vowel loss, phoneme changes, and gradiation exists in Tagalog. These phenomena makes morphological analysis and stemming, a computational procedure that re- duces words with the same root into its common form [3], much more challenging than the English language. Affix extraction is an important process in stemming which is used to extract root words from a text. Affix extrac- tion is also used by morphological analyzers which not only extracts root words but also its morphological properties which can only be identified by extracting the affixes at- tached to a word. Unfortunately, existing works in Tagalog fail to extract affixes properly as they were unable to cover all morphophonemic changes and morphological phenomena present in Tagalog. Pioneering works in linguistic tools for Tagalog such as Fortes’ [6] morphological analyzer for Taga- log verbs, Bonus’ [3] rule-based Tagalog Stemming Algo- rithm (TagSA) and See’s [8] example-based revised Word Frame model for Tagalog face the challenge of modelling all Tagalog morphological phenomena. The lack of electronic resources such as digital dictionaries and Tagalog documents that can be used for analysis is also another obstacle that re- searchers have to face. The slow research and development of even the most basic language tools affect other developments such as information extraction which requires stemmers for querying. 2. TAGALOG MORPHOLOGY When it comes to studying natural language or linguistics, morphology is the study of the form and the structure of words [2]. Studying the internal structure of words involves the studying of morphemes, often defined as the smallest piece in a language that has a grammatical meaning and they can be classified as either bound or free. Free mor- phemes are morphemes that can be used on its own without any modifications such as root words and have its own syn- tactic function such as part of speech. Bound morphemes are morphemes that cannot be used on its own and are required to be attached to a free morpheme which creates a new word. Most bound morphemes are affixes which are used to modify a free morpheme’s grammatical function, syntac- tical function, or tense. Any modification done to a word forms a new word and can be classified into either inflectional words, derivational words, or compound words [6, 3]. Inflec- tional words are formed when bound morphemes change the case, gender, number, tense, person, mood, or voice of the stem they are attached to. Derivational words are created when attached bound morphemes change a base’s syntactic and grammatical category. When two unrelated and inde- pendent root words are concatenated together, compound words are formed.Compound words are new words that is considered a root and can undergo derivational or inflec- tional changes. The morpheme or word that undergo morphological trans- formations to can be categorized into three, namely: base, root word, or stem [5, 2]. Roots are words or free mor- phemes that cannot be further reduced. A stem can be used