Unleashing the Power of Hashtags in Tweet Analytics with Distributed Framework on Apache Storm Vibhuti Gupta and Rattikorn Hewett Department of Computer Science Texas Tech University, Lubbock, TX 79415 Emails: vibhuti.gupta@ttu.edu, rattikorn.hewett@ttu.edu AbstractTwitter is a popular social network platform where users can interact and post texts of up to 280 characters called tweets. Hashtags, hyperlinked words in tweets, have increasingly become crucial for tweet retrieval and search. Using hashtags for tweet topic classification is a challenging problem because of context dependent among words, slangs, abbreviation and emoticons in a short tweet along with evolving use of hashtags. Since Twitter generates millions of tweets daily, tweet analytics is a fundamental problem of Big data stream that often requires a real-time Distributed processing. This paper proposes a distributed online approach to tweet topic classification with hashtags. Being implemented on Apache Storm, a distributed real time framework, our approach incrementally identifies and updates a set of strong predictors in the Naïve Bayes model for classifying each incoming tweet instance. Preliminary experiments show promising results with up to 97% accuracy and 37% increase in throughput on eight processors. Keywords— Twitter; Hashtags; Social Media; Big Data Stream; Ontology; Apache Storm I. INTRODUCTION The proliferation of social media networks in last few years have produced an enormous volumes of data and become a common source of Big data. Twitter is one of the most popular social media platform, where users post short text messages of up to 280 characters, known as tweets for communication. On average, 6000 tweets are generated per second and 500 million tweets per day. Since twitter generates huge, unstoppable, fast growing and unstructured Big data stream of tweets daily, tweet analytics is a fundamental problem of Big data stream that often requires real-time Distributed processing. Hashtags, user-defined hyperlinked words of typical topics, in tweets facilitate efficient information sharing [14]. Hashtags begin with a hash symbol representing various subjects, for examples, #election, #happy, #partying, #nba, #Oscars2016 conveys a topic, emotion, action, official organization, or event, respectively. They are crucial for trend/event detection, search/retrieval and advertisement. Hashtags have been adopted and quickly become common in many blogging sites and social media platforms including Facebook, Instagram, Flickr, Tumblr and Pinterest. Recent research in tweet analytics has studied how hashtags can be effectively applied [2, 3, 6, 11-14, 17]. While many of hashtag applications are successful, tweet classification remains challenging largely due to the nature of tweets and hashtags whose trends can quickly evolve. Tweets have a limited number of words making it hard to derive contexts from dependent words. We also have to cope with ambiguity, slangs, abbreviations and emoticons in tweets. To make things worse, there is no standard on how hashtags are created or expressed. The same subject can have different hashtags defined by different users (e.g., #omg, #ohmygod). Majority of tweet classification [3, 12, 13, 14] deals with sentiment analysis where sentiment classes can be described by semantic of keywords or hashtags while a topic requires a diverse set of hashtags to cover various aspects of it. Our recent work [7] introduced a hybrid hashtag approach to cope with the challenges of tweet topic classification using hashtags. Hybrid Hashtags consist of two types of hashtags: 1) those that are extracted from input tweet data and 2) those derived from a knowledge base of topic (or class) concepts (or topic ontology) by using hashtagify [18], a tool to generate "similar" hashtags from a given term (see more details in [7]). We evaluated the effectiveness of this semi-automated approach using a batch analysis on Naïve Bayes algorithm. The applicability of this approach in real tweet Big data stream requires an online and distributed approach to deal with fast and dynamic arrival rates of tweets. Thus, real time processing with minimum latency is desirable. This paper is different from our previous work [7] in that it presents a fully automated, online and distributed system for tweet topic classification using Hybrid Hashtags as opposed to finding the most effective way to use hashtags for tweet classification in a non-distributed environment. Our contribution is two fold in this paper. First, we propose an online approach (both for data pre-processing and analytic) to analyzing each tweet to identify appropriate hybrid hashtags and incrementally updating an accumulated set of hybrid