Unsupervised Spatial Event Detection in Targeted Domains with Applications to Civil Unrest Modeling Liang Zhao 1 *, Feng Chen 2 , Jing Dai 3 , Ting Hua 1 , Chang-Tien Lu 1 , Naren Ramakrishnan 1 1 Department of Computer Science, Virginia Tech, Falls Church, Virginia, United States of America, 2 Department of Computer Science, University at Albany-SUNY, Albany, New York, United States of America, 3 Google, New York City, New York, United States of America Abstract Twitter has become a popular data source as a surrogate for monitoring and detecting events. Targeted domains such as crime, election, and social unrest require the creation of algorithms capable of detecting events pertinent to these domains. Due to the unstructured language, short-length messages, dynamics, and heterogeneity typical of Twitter data streams, it is technically difficult and labor-intensive to develop and maintain supervised learning systems. We present a novel unsupervised approach for detecting spatial events in targeted domains and illustrate this approach using one specific domain, viz. civil unrest modeling. Given a targeted domain, we propose a dynamic query expansion algorithm to iteratively expand domain-related terms, and generate a tweet homogeneous graph. An anomaly identification method is utilized to detect spatial events over this graph by jointly maximizing local modularity and spatial scan statistics. Extensive experiments conducted in 10 Latin American countries demonstrate the effectiveness of the proposed approach. Citation: Zhao L, Chen F, Dai J, Hua T, Lu C-T, et al. (2014) Unsupervised Spatial Event Detection in Targeted Domains with Applications to Civil Unrest Modeling. PLoS ONE 9(10): e110206. doi:10.1371/journal.pone.0110206 Editor: Renaud Lambiotte, University of Namur, Belgium Received June 20, 2014; Accepted August 25, 2014; Published October 28, 2014 Copyright: ß 2014 Zhao et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The authors confirm that, for approved reasons, some access restrictions apply to the data underlying the findings. Data are available from Twitter API: https://dev.twitter.com/. Twitter data used in this paper was purchased from Datasift Inc (http://datasift.com/). All analyses here are done in compliance with Twitter and Datasift terms of use. Twitter data is available through either the public Twitter API (https://dev.twitter.com/) or through authorized resellers such as Gnip.com and Datasift.com. (Gnip.com has recently been acquired by Twitter). The Twitter data for this paper was purchased from Datasift.com and analysis has been conducted in compliance with the Twitter and Datasift terms of use. Readers interested in purchasing data similar to that used in our paper can contact Datasift using the contact form (as we did) at: http://datasift.com/contact-us/. Different representatives exist at Datasift to cater to different user segments and geographical regions and this contact form provides the best way to reach a representative who can address a specific reader’s query of interest. Funding: This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: Co-author Jing Dai is an employee of Google Inc. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials. * Email: liangz8@vt.edu Introduction Microblogs such as Twitter and Weibo are experiencing an explosive level of growth. Millions of worldwide microblog users broadcast their daily observations on an enormous variety of domains, e.g., crime, sports, and politics. Traditional media, in contrast, is monopolized by closed groups, and on occasion may even be under threat from criminal organizations in localities suffering from conflicts and high crime rates [1]. When a social event occurs, it usually takes hours or even days to be reported by traditional media, which is why social media like Twitter have come to play a major role as a real-time information platform for social events [2,3]. Beyond items of public interest, event-related microblogs can provide highly detailed and timely information for those interested in public safety, homeland security, and financial stability. Figure 1 depicts event hotspots related to the protests on September 27th, 2012 in Mexico. Based on tweets posted on that day, the new approach proposed here automatically and immediately identified these events, some of which were not reported by traditional media until several days later. Although identifying events from news reports has been well studied [4], analyzing tweets to reveal event information requires more sophisticated techniques. Tweets are written in unstructured language and often contain typos, non-standard acronyms, and spam. In addition to the textual content, Twitter data forms a heterogeneous information network where users, tweets, and hashtags have mutual relationships. These features of Twitter data pose a challenge for event detection methods developed for traditional media. Although there has been a considerable body of work on event detection in Twitter, most of the work published has targeted events of general interest. Methods for general interest events typically focus on the ‘‘hotness’’ of events but are not sufficient for tracking events in specific domains. It is of high social significance to continuously and closely monitor crucial domains such as crime [5], earthquakes [6], civil unrest [7], and disease outbreaks [8]. Existing methods in event detection suffer from the following shortcomings: 1) their restricted ability to model heterogeneity and network properties of Twitter data. Existing methods typically treat Twitter data as a set of plain textual documents. However, ‘‘tweet’’, ‘‘word’’, ‘‘hashtag’’, and ‘‘user’’ are of different entity types. For example, a ‘‘user’’ can post a ‘‘tweet’’, ‘‘tweets’’ can be tagged by a ‘‘hashtag’’ and a ‘‘tweet’’ can reply to another ‘‘tweet’’. In general, these heterogeneous relationships and properties are not effectively harnessed by existing methods; 2) their limited ability to handle the dynamic properties of Twitter data. Existing methods treat fixed keywords as features for classifying tweets. However, the expression in tweets dynamically evolves, which makes the use of fixed features and historical PLOS ONE | www.plosone.org 1 October 2014 | Volume 9 | Issue 10 | e110206