Semi-supervised Cross Domain Sentiment Classification on Tweets Using Optimized Topic-Adaptive Word Expansion Technique Savitha Mathapati Ayesha Nafeesa S H Manjula and Venugopal K R Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore University, Bengaluru, India. hiremathsavitha@gmail.com Abstract- Enormous growth of Internet usage, number of social interactions and activities in social networking sites results in users adding their opinions on the products. An automated system called sentiment classifier is required to extract the sentiments and opinions from the social media data. Classifier trained using the labeled tweets of one domain may not efficiently classify the tweets from other domain. This is a basic problem with the tweets as twitter data is very diverse, therefore cross domain sentiment classification is required. In this paper, we propose semi-supervised cross domain sentiment classifier with Optimized Topic-Adaptive Word Expansion (OTAWE) technique on tweets. Initially, clas- sifier is trained on common sentiment words and mixed labeled tweets from various topics. Then, OTAWE algorithm selects more reliable unlabeled tweets from a particular domain and updates domain-adaptive words in every iteration. OTAWE outperforms existing domain adaptive algorithms as it saves the feature weights after every iteration. This ensures that moderate sentiment words are not missed out and avoids the inclusion of weak sentiment words. Index Terms- Cross Domain Sentiment Classification, Opinion Mining and Sentiment Analysis, SVM Classifier, Topic Adaptive Features, Tweets. I. I NTRODUCTION Social media like Facebook, Twitter, Microblogs is a plat- form where people build social relations and share their opinion on various topics. People post their views on the products to guide others in deciding whether they want to buy or not to buy the product [1]. These reviews help business in many ways. The tweets or reviews posted by the users is voluminous that make it difficult to analyze the complete information. To overcome this problem, we use Sentiment Analysis, a powerful method of gaining the overview of the public opinion on a particular topic. The different topics can be connected with the same classifier with an assumption that the topics have certain words in common that can effectively be used to compute the overall sentiment. This methodology is more efficient in case of reviews that define the quality of a product. Tweets have more diverse data and it is sometimes difficult to predict the topics that have been referred in Twitter. Hence, a common classifier built for different tweets may not work efficiently. Twitter data contains information on variety of topics. A classifier trained using the data of one topic may not work efficiently on the data of other topics. This is due to the fact that few sentiment words are different for different topics. For example, consider the following sentences. 1. “This book is very interesting” 2. “Food is very tasty” In sentence 1, “Interesting” is a topic adaptive sentiment word of Book domain where as in sentence 2, “tasty” is a topic adaptive sentiment word of Kitchen domain. Classifier trained using labeled data or tweets of Book domain may not give a good accuracy while classifying Kitchen domain tweets. Thus, topic adaptation is required before sentiment classification, as the tweets are from mixed topics. This process is called Cross Domain Sentiment Classification. Several works focus on building a bridge, that connects common features and domain dependent features [2]. These bridges assume that similar sentiment words are present between pair of domains or topics. This cannot work on tweets as it is from multiple topics or domains and finding the topics from the tweets is a challenge. Motivation: A classifier trained on sentiment data from one topic often performs poorly on the test data from another topic. Shenghua Liu et al., [3] proposed Semi-supervised Topic Adaptive Sentiment Classifier Algorithm. It initially trains a common sentiment classifier for multiple domains and then transforms it into a specific one on an emerging topic or domain in an iterative process. In every iteration, predefined number of topic adaptive words are added to the list of topic adaptive words and remaining are discarded. The drawback of this method is few strong topic adaptive words are missed out if many strong sentiment words are found in that particular iteration. Similarly, weak sentiment words may get selected if not many strong sentiment words are generated in that iteration. Feature weight of each topic adaptive word that are not selected in an iteration is not saved and when the same International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017 370 https://sites.google.com/site/ijcsis/ ISSN 1947-5500