Semi-supervised Cross Domain Sentiment Classiﬁcation on Tweets Using Optimized Topic-Adaptive Word Expansion Technique Savitha Mathapati Ayesha Nafeesa S H Manjula and Venugopal K R Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore University, Bengaluru, India. hiremathsavitha@gmail.com Abstract- Enormous growth of Internet usage, number of social interactions and activities in social networking sites results in users adding their opinions on the products. An automated system called sentiment classiﬁer is required to extract the sentiments and opinions from the social media data. Classiﬁer trained using the labeled tweets of one domain may not efﬁciently classify the tweets from other domain. This is a basic problem with the tweets as twitter data is very diverse, therefore cross domain sentiment classiﬁcation is required. In this paper, we propose semi-supervised cross domain sentiment classiﬁer with Optimized Topic-Adaptive Word Expansion (OTAWE) technique on tweets. Initially, clas- siﬁer is trained on common sentiment words and mixed labeled tweets from various topics. Then, OTAWE algorithm selects more reliable unlabeled tweets from a particular domain and updates domain-adaptive words in every iteration. OTAWE outperforms existing domain adaptive algorithms as it saves the feature weights after every iteration. This ensures that moderate sentiment words are not missed out and avoids the inclusion of weak sentiment words. Index Terms- Cross Domain Sentiment Classiﬁcation, Opinion Mining and Sentiment Analysis, SVM Classiﬁer, Topic Adaptive Features, Tweets. I. I NTRODUCTION Social media like Facebook, Twitter, Microblogs is a plat- form where people build social relations and share their opinion on various topics. People post their views on the products to guide others in deciding whether they want to buy or not to buy the product [1]. These reviews help business in many ways. The tweets or reviews posted by the users is voluminous that make it difﬁcult to analyze the complete information. To overcome this problem, we use Sentiment Analysis, a powerful method of gaining the overview of the public opinion on a particular topic. The different topics can be connected with the same classiﬁer with an assumption that the topics have certain words in common that can effectively be used to compute the overall sentiment. This methodology is more efﬁcient in case of reviews that deﬁne the quality of a product. Tweets have more diverse data and it is sometimes difﬁcult to predict the topics that have been referred in Twitter. Hence, a common classiﬁer built for different tweets may not work efﬁciently. Twitter data contains information on variety of topics. A classiﬁer trained using the data of one topic may not work efﬁciently on the data of other topics. This is due to the fact that few sentiment words are different for different topics. For example, consider the following sentences. 1. “This book is very interesting” 2. “Food is very tasty” In sentence 1, “Interesting” is a topic adaptive sentiment word of Book domain where as in sentence 2, “tasty” is a topic adaptive sentiment word of Kitchen domain. Classiﬁer trained using labeled data or tweets of Book domain may not give a good accuracy while classifying Kitchen domain tweets. Thus, topic adaptation is required before sentiment classiﬁcation, as the tweets are from mixed topics. This process is called Cross Domain Sentiment Classiﬁcation. Several works focus on building a bridge, that connects common features and domain dependent features [2]. These bridges assume that similar sentiment words are present between pair of domains or topics. This cannot work on tweets as it is from multiple topics or domains and ﬁnding the topics from the tweets is a challenge. Motivation: A classiﬁer trained on sentiment data from one topic often performs poorly on the test data from another topic. Shenghua Liu et al., [3] proposed Semi-supervised Topic Adaptive Sentiment Classiﬁer Algorithm. It initially trains a common sentiment classiﬁer for multiple domains and then transforms it into a speciﬁc one on an emerging topic or domain in an iterative process. In every iteration, predeﬁned number of topic adaptive words are added to the list of topic adaptive words and remaining are discarded. The drawback of this method is few strong topic adaptive words are missed out if many strong sentiment words are found in that particular iteration. Similarly, weak sentiment words may get selected if not many strong sentiment words are generated in that iteration. Feature weight of each topic adaptive word that are not selected in an iteration is not saved and when the same International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017 370 https://sites.google.com/site/ijcsis/ ISSN 1947-5500