Comparative analysis of actual language usage and selected grammar and orthographical rules for Filipino, Cebuano-Visayan and Ilokano: a Corpus-based Approach Joel P. Ilao 1 , Timothy Israel D. Santos 2 and Rowena Cristina L. Guevara 3 Digital Signal Processing Laboratory Electrical and Electronics Engineering Institute, University of the Philippines Diliman {joel.ilao 1 , timothy_israel.santos 2 }@up.edu.ph, gev@eee.upd.edu.ph 3 ABSTRACT The implementation of Mother Tongue-Based Multilingual Education (MTBMLE) will require definitive rules for orthography and grammar. While there are such rules for some Philippine languages, there is a need to determine the agreement and points of departure between the rules and the usage to avoid confusion. This study makes an objective analysis of the levels of agreement, in terms of grammar and orthographic rules, between reference books and actual usage as evidenced from web-mined text corpora for three major Philippine languages, namely Filipino, Cebuano-Visayan and Ilokano. A list of language rules on grammar and orthography were selected from standard reference books for each of the aforementioned languages. Alternative forms of usage for each selected language rule were identified, and frequency counts were made, to be used as bases for a comparative analysis between the rules prescribed by standard reference books and actual language usage. The techniques used in this study are important in language education, serving to identify areas of confusion in language use in aspects of grammar and orthography. Categories and Subject Descriptors J.5 [Arts and Humanities]: Linguistics General Terms Human Factors, Standardization, Verification Keywords Orthography, Corpus Linguistics 1. Introduction Since the Department of Education (DepEd) has officially presented an MTBMLE initiative through its department order circular in 2009 [1], mechanisms have been put into place so that the preparations for rolling out the MTBMLE program towards its full implementation on June, 2012 will significantly gain momentum. Dekker (2010) pointed out that a crucial component to any Mother Tongue Language Education initiative is the development of a “writing system that will be acceptable to the majority of mother tongue speakers and to the government and will encourage members of the language communities to continue reading and writing in their language” [2], a requirement echoed by Nolasco (2011) for community-based MTBMLE programs in his MLE Primer [3]. Looking at the 2009 DepEd circular, and considering the papers that show successful MTBMLE practice, it is evident that an important prerequisite to this program is a working orthography that is widely acceptable to the learning community, and which is compatible to that language‟s intellectualization. The linguistic diversity of the Philippines, with 171 living languages and around 500 dialects, is a big challenge to such an initiative, where the requisite maturity of orthographic systems of each candidate language of instruction cannot be guaranteed. Moreover, as the MTBMLE program matures, there comes a need to refine the grammatical and orthographic rules of the language being used for instruction, as it is increasingly being used in the academic setting. These scenarios argue for the need of a system that can periodically monitor the state of a language‟s development, by observing how it is being used by a population of users. In this paper, we present examples of the current state of language development for three major Philippine languages, through our corpus-based analysis of Filipino/Tagalog, Cebuano and Ilokano. This paper is organized as follows: Section 1 focuses on the developments in the orthography of the three major languages just mentioned; Section 2 introduces our methodology for detecting groups of spelling variants extracted from corpora for the three major languages; Section 3 presents the data gathered, the list of spelling variant groups culled from the text corpora-on-hand, and our analysis of the experimental results. We end this paper with our conclusions and recommendations for future work. 1.1 System of Orthography for the top three major Philippine languages according to population of users According to a census made by the National Statistics Office (NSO) in 2000, at least 10 of the 171 living languages of the Philippines are considered major languages, where a major language is defined as a language spoken by at least 1 million people. Also based on data coming from the aforementioned census, the top three languages, ranked according to population of speakers are Tagalog (~21.5 million native speakers or 28.15% of the total Filipino population), Cebuano (~20 million native speakers or 26.1% of the total population), and Ilokano (~7.7 million native speakers or 10.1% of the total population). These three languages are also considered as regional lingua francae, with Ilokano spoken by regions in the Northern part of Luzon