A Boosted SVM based Ensemble Classifier for Sentiment Analysis of Online Reviews Anuj Sharma Chandragupt Institute of Management Hindi Bhavan, Patna – 800001, India f09anujs@iimidr.ac.in Shubhamoy Dey Indian Institute of Management Prabandh Shikhar, Rau, Indore – 453331, India shubhamoy@iimidr.ac.in ABSTRACT In recent years, several approaches have been proposed for sentiment based classification of online text. Out of the different contemporary approaches, supervised machine learning techniques like Naive Bayes (NB) and Support Vector Machines (SVM) are found to be very effective, as reported in literature. However, some studies have reported that the conditional independence assumption of NB makes feature selection a crucial problem. Moreover, SVM also suffers from other issues like selection of kernel functions, skewed vector spaces and heterogeneity in the training examples. In this paper, we propose a hybrid method by integrating “weak” support vector machine classifiers using boosting techniques. The proposed model exploits classification performance of Boosting while using SVM as the base classifier, applied for sentiment based classification of online reviews. The results on movies and hotel review corpora of 2000 reviews have shown that the proposed approach has succeeded in improving the performance of SVM. The resultant ensemble classifier has performed better than the single base SVM classifier, and the results confirm that ensemble SVM with boosting, significantly outperforms single SVM in terms of accuracy. 1 Categories and Subject Descriptors I.5.2 [Pattern Recognition ]: Design Methodology—classifier design and evaluation, feature evaluation and selection; I.5.1 [Pattern Recognition ]: Models—SVM; I.2.7 [ Natural Language Processing ] – Text analysis General Terms Performance, Design, Experimentation, Theory Keywords SVM, Sentiment Analysis, Classification, Sentiment Lexicon, Text Mining 1. INTRODUCTION Sentiment analysis and opinion mining of online user generated text content has already proved to be a promising research domain 1 Copyright is held by the authors. This work is based on an earlier work: RACS'13 Proceedings of the 2013 ACM Research in Adaptive and Convergent Systems, Copyright 2013 ACM 978-1- 4503-2348-2/13/10. http://doi.acm.org/10.1145/2513328.2513311 . with growing popularity of Web 2.0 social media [12, 15]. Consumers and users have enthusiastically raised their voices and expressed their sentiments in the form of textual posts on social media for virtually anything they care about. Web 2.0 based mediums like message forums, blogs and review sites have emerged as good sources of expressed opinion and sentiments on a large scale [23]. The large scale opinionated text available on the Internet and Web 2.0 social media has created ample research opportunities for business and academia. Different research works have associated opinion expressed in online reviews to product sales [42], opinion in online discussion to prediction of best travel destinations [41], and public sentiments in political debates to results of general elections [35], the list is limitless. In case of online reviews, researchers have concluded that web based opinion are a good proxy for word-of-mouth [5, 23]. With the rapid growth of the social media, more and more users post reviews for all types of products and services and place them on online forums. It is becoming a common practice for a potential consumer to learn how much others like or dislike a product before arriving at a purchase decision. By processing the reviews, product manufacturers and marketing professionals can keep track of customer opinions of theirproducts, with the aim of improving the user satisfaction. However, as the number of reviews available for any given product grows, it becomes a more time consuming task for buyers to understand and evaluate what the prevailing opinion trend about the product is. So, from the point of view of users, to read these millions of reviews from different Web 2.0 based sources is nearly impossible. Moreover, it is also an expensive process for the companies to track the opinion about their products or services in the large volume of online reviews. The large volume of opinionated data poses severe data processing and sentiment extraction related challenges. Different contemporary solutions based on different machine learning, dictionary, statistical, and semantic based approaches have been proposed for sentiment analysis of online textual data [6, 23, 37]. Existing machine learning approaches have given promising results [16, 30]. Therefore, it is important to enhance these existing techniques that can extract knowledge from voluminous subjective or opinionated texts . Though the other approaches like dictionary and semantic orientation based approaches perform quicker than machine learning based approaches and have no requirement of pre- annotated text, studies have reported poor results, in terms of accuracy, in real-life applications. The maintenance of sentiment APPLIED COMPUTING REVIEW DEC. 2013, VOL. 13, NO. 4 43