Exploring a Scalable Solution to Identifying Events in Noisy Twitter Streams Shamanth Kumar † , Huan Liu † , Sameep Mehta ⋆ , and L. Venkata Subramaniam ⋆ † Computer Science & Engineering, CIDSE, Arizona State University, Tempe, AZ ⋆ IBM India Research Lab, New Delhi, India Email:{ shamanth.kumar, huan.liu}@asu.edu, {sameepmehta, lvsubram}@in.ibm.com Abstract—The unprecedented use of social media through smartphones and other web-enabled mobile devices has enabled the rapid adoption of platforms like Twitter. Event detection has found many applications on the web, including breaking news identification and summarization. The recent increase in the usage of Twitter during crises has attracted researchers to focus on detecting events in tweets. However, current solutions have focused on static Twitter data. The necessity to detect events in a streaming environment during fast paced events such as a crisis presents new opportunities and challenges. In this paper, we investigate event detection in the context of real-time Twitter streams as observed in real-world crises. We highlight the key challenges in this problem: the informal nature of text, and the high-volume and high-velocity characteristics of Twitter streams. We present a novel approach to address these challenges using single-pass clustering and the compression distance to efficiently detect events in Twitter streams. Through experiments on large Twitter datasets, we demonstrate that the proposed framework is able to detect events in near real-time and can scale to large and noisy Twitter streams. I. I NTRODUCTION Social networking sites like Twitter have proven to be popular outlets for information dissemination during crises. It has been observed that information related to a crisis is released on social media sites before traditional news sites [1]. This motivates us to study the problem of event detection, which is an interesting and important problem in this domain. Event detection approaches designed for documents cannot be directly applied to tweets due to the difference in the characteristics of tweets. Unlike a traditional document stream, a Twitter stream suffers from the informality of language, and differs in both volume and velocity characteristics. Existing approaches to event detection in tweets focus on the problem in an offline setting, where the corpus is static and multiple passes can be employed in the solution. However, event detection in streaming environment presents unique challenges, which prevent the direct application of existing approaches. Detecting events in streaming Twitter data have the following new challenges: Informal use of language: Twitter users produce and consume information in an informal manner [2]. Mis- spellings, abbreviations, contractions, and slang are rampant in tweets, which is promoted by the length restriction (a tweet can have no more than 140 characters). Noise: While traditional event detection approaches assume that all documents are relevant, Twitter data typically contains a vast amount of noise and not every tweet is related to an event [3]. Dynamicity: Twitter streams are highly dynamic with high volume and high velocity. Approximately 400 million tweets are now posted on Twitter every day [4]. Event detection methods need to be scalable to handle this high volume of tweets. Social media such as Twitter empower their users to publish information as soon as a real-world event occurs. However, this information is not curated as in the case of traditional documents, such as a news article. Whereas, each news article is part of an event, not every tweet is expected to be part of an event, as there is a significant amount of noise and inter- personal communication. In this paper, we address the above challenges through a novel approach which can: • Effectively handle the informality of language in a Twitter stream through an appropriate distance measure; • Capture evolving events without the need for labeled events; and • Scale to high-volume streaming Twitter data. II. RELATED WORK Event detection in traditional media is known as Topic Detection and Tracking (TDT). In [5], news articles were modeled as documents to detect topics. The documents were transformed into vector space using TF-IDF and evaluated the Group-Average Agglomerative Clustering (GAAC) for retrospective event detection, and Incremental Clustering for new event detection. Allan et al. [6] focused on online event detection. The authors constructed a query from the k most frequent words in a story. If a new document did not trigger any existing queries, then it was considered to be part of a new event. In [7], the authors addressed the problem of detecting hot bursty events. They introduced a parameter-free clustering approach called feature-pivot clustering, which attempted to detect and cluster bursty features into hot stories. Similarly, [8] interpreted events as hashtag clusters and propose a hierarchi- cal spatio-temporal clustering of tweets into events. An attempt to detect earthquakes using Twitter users as social sensors was made in [9]. The temporal aspect of an event was modeled as an exponential distribution, and the probability of the event was determined based on the likelihood of each sensor being incorrect. In [10], the authors constructed word signals using the Wavelet Transformation and applied a modularity-based graph partitioning approach on the correlation matrix to get event clusters. [11] identified bursty segments in tweets and clustered them to identify events. In [12], the authors model the social text streams including blogs and emails as a multi-graph and cluster the streams using textual, temporal, and social information to detect events. A hybrid network and content based clustering approach was