Combining Lexicon and Learning based Approaches for Concept-Level Sentiment Analysis Andrius Mudinas DCSIS Birkbeck, University of London London WC1E 7HX, UK andrius@dcs.bbk.ac.uk Dell Zhang DCSIS Birkbeck, University of London London WC1E 7HX, UK dell.z@ieee.org Mark Levene DCSIS Birkbeck, University of London London WC1E 7HX, UK mark@dcs.bbk.ac.uk ABSTRACT In this paper, we present the anatomy of pSenti —a concept-level sentiment analysis system that seamlessly inte- grates into opinion mining lexicon-based and learning-based approaches. Compared with pure lexicon-based systems, it achieves significantly higher accuracy in sentiment polarity classification as well as sentiment strength detection. Com- pared with pure learning-based systems, it offers more struc- tured and readable results with aspect-oriented explanation and justification, while being less sensitive to the writing style of text. Our extensive experiments on two real-world datasets (CNET software reviews and IMDB movie reviews) confirm the superiority of the proposed hybrid approach over state-of-the-art systems like SentiStrength. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— data mining ; I.2.6 [Artificial Intelligence]: Learning; I.5.2 [Pattern Recognition]: Design Methodology—classifier design and evaluation General Terms Algorithms, Experimentation, Performance Keywords Opinion Mining, Sentiment Analysis, Natural Language Processing, Supervised Learning. 1. INTRODUCTION Everyday a large number of opinion related documents are put on the Internet – people post product reviews, express their political views, and share their feelings. The ability to extract sentiments from such sources can provide invaluable information about people’s views on various topics. Many of today’s sentiment analysis systems are based on so-called lexicon design, having domain-specific senti- ment lexicons as their main sentiment information source Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WISDOM’12, August 12, 2012, Beijing, China. Copyright 2012 ACM 978-1-4503-1543-2/12/08 ...$15.00. [6, 20, 21]. Such an approach is usually implemented in two separate steps: lexicon detection/extension and sentiment strength measurement. On the other hand sentiment detec- tion can be treated as a simple classification problem and achieve very high accuracy by employing various machine learning algorithms, such as Na¨ ıve Bayes or Support Vector Machine (SVM). Yet simple classification provides limited information about sentiment topic or rationale. In this paper, we present the anatomy of pSenti —a concept-level sentiment analysis system that seamlessly inte- grates into opinion mining lexicon-based and learning-based approaches. The main idea is to generate the feature vec- tors for supervised machine learning in the same fashion as is seen in lexicon-based sentiment analysis systems. Com- pared with pure lexicon-based systems, it achieves signifi- cantly higher accuracy in sentiment polarity classification as well as sentiment strength detection. Compared with pure learning-based systems, it offers more structured and readable results with aspect-oriented explanation and jus- tification, while being less sensitive to the writing style of text. The ability to perform cross-style sentiment analysis is very meaningful, as it implies that we can train the sys- tem using formal professional reviews as training examples and then apply the system to sentiment analysis on informal customer reviews from data sources such as blogs or twit- ter. Our extensive experiments on two real-world datasets (CNET software reviews and IMDB movie reviews) have confirmed the superiority of the proposed hybrid approach over state-of-the-art systems like SentiStrength [20, 21]. The rest of this paper is organised as follows. In Section 2, we review the related work. In Section 3, we present our pSenti system based on the hybrid approach in details. In Section 4, we show the experimental results on two real- world datasets. In Section 5, we make conclusions. 2. RELATED WORK In recent years, opinion mining, aka sentiment analysis, attracted a lot of interest and has been studied by many researchers. In their early work, Hatzivassiloglou and McK- eown [7] reported that it is possible to identify sentiment words (adjectives) and their polarity in sentences with a high accuracy of 82%. Following this finding, various sentiment analysis algorithms have been proposed. For example, Tur- ney [22] introduced one of the first algorithms for document level sentiment analysis, which achieved an average accuracy of 74% for product reviews; but on movie reviews the per- formance was much worse – only 66%. In his design, rather than focusing on isolated adjectives, Turney proposed to de-