Bisaya Sentiment Analysis: A Supervised Machine Learning Approach Eric P. Ortega College Faculty University of Cebu - Banilad Cebu City epo.ortega@gmail.com ABSTRACT The significant volume and variety of the unstructured textual data from the online reviews, opinions, and comments provides potential asset to developing intelligent applications that can span across verticals from various languages. This paper presents the Bisaya language and the supervised machine learning approach to sentiment analysis in classifying the polarity of a Bisaya sentence. The application is useful for different verticals like government addressing queries to specific bureau. In faculty evaluation where students’ comments are classified whether positive or negative. In marketing and sales to position products, and consumers in their purchasing decisions. The study also provides the development of the Bisaya corpus as gold standard dataset to fodder the basics on the development and testing the sentiment analysis system around this dataset. From these large collections of written resources, Natural Language Processing (NLP) rules are crafted to select group of annotators and further valuated and examined by respected linguists. A bag-of- words approach models each document by counting the occurrence of each unique keyword mapped to a certain polarity. To provide better understanding of the data at large, a feature extraction scoring technique was formulated from the term counts. The scoring methods are the central tool in unlocking polarity value of the feature. Polar weights are lexicon estimates of the attitude of each unigram in the bag-of-words. The study devised the Multinomial Naïve Bayes (MNB) classifier model trained from the Bisaya corpus. This also examines how MNB works in machine learning since its goodness for discrete data counts of the unigram feature. The performance of the model was evaluated through an appropriate test metrics to observe the feature extraction methods’ usefulness and efficacy. A stratified ten- fold cross validation checks whether there is an improvement in the system’s performance. Values produced from the initial experiments are baseline data leading to better learning performance and further interpretatibility of the model possibly in the feature extraction and selection, and the annotation guidelines to determine set of effective features appropriate for the Bisaya sentiment analysis system. Keywords Bisaya Sentiment Analysis; Bisaya language; Bisaya corpora; machine learning; Text Mining; Multinomial Naïve Bayes; Natural Language Processing. 1. INTRODUCTION Web applications have made social media revolutionize the platform of exchanging overwhelming ideas through weblogs, events, forums, and news that made available by citizens through “participatory sensing.” Over time, individuals take a proactive role in publishing comments/feedbacks, reactions, and complaints [5], in various pervasive network (social, e-commerce, and reviews) sites that lead to the tremendous user-generated data. The vast user- generated content considered the potential asset to developing intelligent applications critical for various fields such as commerce, politics, health, education, and government for stakeholders (social analyst, psychologists, researchers) and purposes. However, the essential content of shared information is less organized and stowed in an improper construction that is almost impossible for an individual to drink all the reviews. Mining and analyzing the bulk of user-generated content requires the use of automated techniques to discover interesting information and knowledge from unstructured documents, which posed severe challenges in the classification of sentiments. Proper classification of this information entails text mining, machine learning, and natural language processing. Text mining or text analytics is an activity of discovering new information from a large collection of written resources by establishing a structure in unstructured text used for further analysis. It extracts a model describing important data classes. Text mining is facilitated by means of NLP techniques for applications like sentiment analysis, language identification, filtering documents and more. A classification activity in sentiment analysis is a form of data analysis that is used to devise a machine learning model--a classifier to predict categorical class labels. Sentiment analysis is a dynamic information management task to automatic value identification of emotional tone from opinions expressed to determine attitudes and emotions of the speaker. It has been performed at many levels of granularity depending on whether the target of the study is in word level[6], sentence[5], or the entire document[4] in different languages. Sentence level sentiment analysis has been approached through exploiting the structure in the document or seen as bag of words. Learning the structure of the problem primarily needs the engineering of effective features. Features are attributes that provide special meaning selected from the dataset that thought to capture pattern of the data for the predictive model at hand. Engineering the features is an important task to get the most out of the data for the predictive model to work leading to an improved accuracy on new instances. Several features were chosen from supervised machine learning techniques such as n-gram presence or frequency[6], POS tags[10], and syntactic feature[8], or a combination to represent the data. This daunting task poses a serious problem in constructive induction as several possible features can generate to enrich the representation of the sentence.