International Journal of Computer Applications (0975 8887) Volume 181 No. 37, January 2019 13 Cross Domain Sentiment Classification Techniques: A Review Parvati Kadli Associate Professor PDIT, Hosapete Karnataka, India Vidyavathi B. M. Professor BITM, Ballari Karnataka, India ABSTRACT With the explosive growth in the availability of online resources, sentiment analysis has become an interesting topic for researchers working in the field of natural language processing and text mining. The social media corpus can span many different domains. It is difficult to get annotated data of all domains that can be used to train a learning model. Hence continuous efforts are made to tackle the issue and many techniques have been designed to improve cross domain sentiment analysis. In this paper we present literature review of methods and techniques employed for cross domain sentiment analysis. The aim of the review is to present an overview of techniques and approaches, datasets used to solve cross domain sentiment classification problem in the research work carried out in the recent years. General Terms Sentiment Analysis, Classifier, Dataset, Features. Keywords Cross Domain Sentiment Classification (CDSC), Source Domain, Target Domain. 1. INTRODUCTION Users express their opinions about products and services they consume in social media like reviews, blog spots, shopping sites, twitters etc. Sentiment analysis is a computational study of people’s attitude, appraisals and opinions about individuals, issues, entities, topics, events and products [1]-[5]. Sentiment analysis includes the concepts of natural language processing, machine learning and computation linguistics. It aims at classifying sentiment data into polarity categories. Users do not specify sentiment polarity explicitly. Hence, we need to predict it from text data generated by users. One of the main requirements for accurate performance is annotated data in various domains. This would imply huge cost for large numbers of domains and prevent us from exploiting the information shared across domains. Also, it is not feasible to develop different models for different domains for classification. Research work is taken up to solve this issue. One feasible solution is to develop a single system for sentiment classification using labeled and unlabeled data from different domains and apply it for any target domain. This is Cross Domain Sentiment Analysis. This study aims to present recent works on such cross-domain sentiment classification. Organization of the paper is, section 2 explains the challenges in CDSC and section 3 briefs the early research and baseline methods. Section 4 explains the key techniques for CDSC. The last sections present general discussion and conclusion. 2. CHALLENGES IN CDSC The most critical challenge is that sentiment analysis is highly dependent on the domain i.e. a technique performing well on one domain might perform poorly on another. It is challenging as machine learning techniques used for cross domain classification perform well with labeled documents and hence are highly domain sensitive. A mismatch between review ratings and review text also affects performance [19]. We get inconsistent results because of poor target domain compared to rich labeled source domain, using which the classifier is trained. Some of the main challenges are as follows: Sparsity: When the target corpora contains words or phrases that do not appear or rarely appear in source domain. Polysemy: The meaning of the same word appearing in source and target domain changes based on the context of the respective domain. Feature Divergence: If the classifier is trained on source specific features and these may mismatch with domain specific features on which the classifier is applied. Feature divergence refers to the mismatch in source domain specific features and target domain specific features [6]-[7] Polarity Divergence: Same word may have difference polarity in different domains. Example cheap may be positive in one domain and may have negative meaning in some other domain. 3. EARLY RESEARCH AND BASELINE METHODS In the early days, classifiers were trained and tested on a same domain. This is single domain classification. The first results of polarity classification using machine learning techniques were reported by Pang et al. [8]. Movie reviews were extracted from IMDB. First results on CDSC were given by Blitzer et al [20] Reviews on Books, Electronics, DVDs and kitchen domain were used. In other approaches groups of classifiers were trained on source domains [9]. For example in TPLSA (Topic-Bridged Probabilistic Latent Semantic Analysis) developed by [21] joint Probabilistic model is used to bridge the test and training domains. Identification of prime topic is obtained as a concurrent decomposition of contingency tables which are based on occurrence of terms in both test and training domain documents. Later collaborative dual PLSA was developed by [22] which exploited commonality and domain distinction among multiple domains. Document class and word concept are two latent concepts of this model. For Evaluation of new approaches developed baseline methods like SCL, SFA, SCL-MI techniques are used.