SEED: A Framework for Extracting Social Events from Press News Salvatore Orlando DAIS - Università Ca’ Foscari Venezia, Italy orlando@unive.it Francesco Pizzolon DAIS - Università Ca’ Foscari Venezia, Italy pizzolon.francesco@gmail.com Gabriele Tolomei DAIS - Università Ca’ Foscari Venezia, Italy gabriele.tolomei@unive.it ABSTRACT Everyday people are exchanging a huge amount of data through the Internet. Mostly, such data consist of unstruc- tured texts, which often contain references to structured in- formation (e.g., person names, contact records, etc.). In this work, we propose a novel solution to discover social events from actual press news edited by humans. Con- cretely, our method is divided in two steps, each one ad- dressing a specific Information Extraction (IE) task: first, we use a technique to automatically recognize four classes of named-entities from press news: Date, Location, Pla- ce, and Artist. Furthermore, we detect social events by extracting ternary relations between such entities, also ex- ploiting evidence from external sources (i.e., the Web). Fi- nally, we evaluate both stages of our proposed solution on a real-world dataset. Experimental results highlight the qual- ity of our first-step Named-Entity Recognition (NER) ap- proach, which indeed performs consistently with state-of- the-art solutions. Eventually, we show how to precisely se- lect true events from the list of all candidate events (i.e., all the ternary relations), which result from our second-step Relation Extraction (RE) method. Indeed, we discover that true social events can be detected if enough evidence of those is found in the result list of Web search engines. Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Process- ing—Text analysis ; I.5.4 [Pattern Recognition]: Applica- tions—Text processing Keywords Information extraction; Named-entity recognition; Relation extraction; Social event discovery 1. INTRODUCTION In the last two decades, a huge amount of data are increas- ingly become available due to the exponential growth of the World Wide Web. Though heterogeneous, the vast major- ity of these data are unstructured texts, which anyway often refer to more structured information, such as person names, company names, contact records, etc.. In this paper, we propose a solution to a real problem raised up by a Web company, namely to detect structured in- Copyright is held by the International World Wide Web Conference Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media. WWW 2013 Companion, May 13–17, 2013, Rio de Janeiro, Brazil. ACM 978-1-4503-2038-2/13/05. formation about social events from unstructured press news. The company’s mission is to spread, advertise, and recom- mend cultural events to people for their leisure time, mostly, yet not only, through the Web. Concretely, it focuses on ital- ian events occurring in several places and dates, performed by several national and international artists. So far, events are manually recognized by members of the company’s editorial oce before being published on the com- pany’s Web site. In a nutshell, several journalists carefully read and inspect long and ambiguous press news looking for significant information about actual events. It turns out that this process may be prolix and lead to a waste of working hours. Thereby, the (semi-)automatic discovery of events from press news is definitely a challenging task, which in turn may help the company accelerate its whole business. This is an instance of the more general Information Ex- traction (IE) problem, which refers to the discovery of struc- tured information from unstructured data sources (i.e., typ- ically texts). More precisely, in this work we consider two IE-related tasks: (i) Named-Entity Recognition (NER) and (ii) Relation Extraction (RE). The former aims to extract and classify entities from unstructured text. In our scenario, this turns out to detecting the following classes of entities from press news: Date, Location (i.e., municipalities in Italy), Place (i.e., places aliated with the company), and Artist. The latter tries to identify relations between en- tities. In our case, relations represent events by means of 3-ary tuples connecting our entity classes in the following way: (Artist, Location, Date) and (Artist, Place, Da- te). These two kinds of tuples well describe entertainment events indicating that a specific artist is performing in a certain place or location on a precise date. In this work, we introduce Social Entertainment Event Detection (SEED), a framework that achieves both the IE tasks. Concerning the NER stage, SEED does not make use of any statistical learning method. In fact, since entities are well-known by the company, they can be extracted sim- ply through regular expressions and perfect matching with existing backend database of entities (i.e., gazetteers ). Conversely, the main novelty of this work regards the RE task. Usually, solutions to RE limit their scope to an in- dividual sentence of the single text document, which the entities have been previously extracted from. However, in our scenario, relations can span over the single sentence and sometimes even across several press news. For instance, it may happen that an artist, a place, and a date – which are named in the same sentence – are not referring to a true entertainment event. 1285