Scalable and Real-time Sentiment Analysis of Twitter Data Maria Karanasou, Anneta Ampla, Christos Doulkeridis and Maria Halkidi Department of Digital Systems, School of Information and Communication Technologies University of Piraeus, Piraeus, Greece Email:karanasou@gmail.com, anneta.ampla@hotmail.com, {cdoulk,mhalk}@unipi.gr Abstract—In this paper, we present a system for scalable and real-time sentiment analysis of Twitter data. The proposed system relies on feature extraction from tweets, using both morphological features and semantic information. For the sentiment analysis task, we adopt a supervised learning approach, where we train various classifiers based on the extracted features. Finally, we present the design and implementation of a real-time system architecture in Storm, which contains the feature extraction and classification tasks, and scales well with respect to input data size and data arrival rate. By means of an experimental evaluation, we demonstrate the merits of the proposed system, both in terms of classification accuracy as well as scalability and performance. I. I NTRODUCTION Online social networking platforms (such as Twitter, Tum- blr, Weibo) where users are enabled to send short messages and express opinions on specific topics and their sentiments on them have increased rapidly the last few years. The amount of posted information keeps increasing to unimaginable levels. Thus the requirement for data analysis techniques that process posts online and assist with extracting interesting patterns of knowledge from them is stronger than ever. Sentiment analysis and opinion mining have attracted the attention of the research community lately, due to numerous applications that are related to automated processing and anal- ysis of text corpora. Traditional sentiment analysis approaches have been designed for static and well-controlled scenarios [11]. In microblogging environment the real-time interaction is a key feature and thus the ability to automatically analyze information and predict user sentiments as discussions develop is a challenging issue. The challenges that data analysis has to tackle in case of microblogging data is the use of informal, abbreviated, evolving language as well as the lack of information due to the short messages that are exchanged. In this paper, we address the above challenges designing a scalable real-time sentiment analysis system. We proposed a methodology for extracting useful features from posts in order to represent them in sentiment analysis process. Moreover we developed a scalable system that processes tweets in real- time and uses supervised learning techniques to predict their sentiments. Our sentiment analysis models are adapted to the evolution of microblogging data exploiting the feedback that experts provide. Summarizing, the main contributions of this paper are as follows: • We develop a framework for sentiment analysis of Twitter data based on supervised learning techniques. The main components of this framework consist of: (i) a preprocess- ing module that assists with refining the data collection and selecting the features that properly represent the Twitter data (ii) a supervised learning module that aims to identify the sentiment polarity in Twitter data and properly classify them. • We study the use of ensemble learning methods in the context of sentiment analysis, and we present the use of a feedback mechanism in the sentiment analysis process that is adaptable to dynamic contents. • We design a real-time system architecture based on Storm to deal with evolution and volume of Twitter data. • We evaluate our approach using various datasets. The collection of tweets is selected so that it contains a variety of words, expressions, emotional signals as well as indicative examples of sarcastic, ironic, metaphoric language. Also we conducted experiments considering the combination of multiple features (incl. prior polarity, text similarity, pattern detection). The rest of this paper is organized as follows: Section II provides an overview of related work. Section III describes an overview of our approach, including the feature extraction and classification. In Section IV, we present the system implementation for real-time sentiment analysis using Storm. In Section V, we present the experimental study, and in Section VI we conclude the paper. II. RELATED WORK In this section we briefly discuss approaches related to sentiment analysis in microblogging data. For a brief survey we refer to [9], while we point to [5], [6] for a recent overview of the topic of Big Social Data Analysis. Scalable Sentiment Analysis. Scalable systems for sentiment analysis can be categorized in real-time systems [11], [24] and systems for batch processing [15]. In [24], a system is presented for real-time sentiment analysis on Twitter streaming data towards presidential candidates (US 2012). Results are delivered continuously and instantly, and feedback based on human annotation is proposed, however the online feedback loop and update of the trained model is left as future work. Real-time sentiment analysis is also targeted in [11] by means of transfer learning, where several challenges are identified,