Information Extraction from Microblogs Prashant Bhardwaj Computer Science and Engineering National Institute of Technology Agartala cse.pbh@gmail.com Partha Pakray Computer Science and Engineering National Institute of Technology Mizoram parthapakray@gmail.com ABSTRACT The micro blogging sites contain the emotion and expression of the public in raw format. The data can be used to extract much meaningful information that could be used to develop technologies for future use. There are numerous micro blogging sites available these days that are used in different contexts. Some are used basically for conversation, some for image and video sharing, and some for formal and official purposes. Twitter is one of the most outspoken platform for sharing the emotions and comments on almost every topic starting from sport to entertainment, religion to politics and many more. The paper attempts to extract information from the database of the tweets collected from Twitter. The task is to develop methodologies for extracting tweets that are relevant to each topic with high precision. This paper presents the nita_nitmz team participation in FIRE 2016 Microblog track. CCS Concepts • Computing methodologies ~ Natural language Processing • Information systems ~ Information extraction Keywords Information Retrieval ; Micro blogging ; Twitter 1 INTRODUCTION The faster growth of Internet in current period provides new sources of information. Now a day’s people prefer to express themselves more often on social sites than any print media. The idea of Information Extraction from Microblogs Posted during Disasters was introduced by Sarah Vieweg et.al., in 2010 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems [1]. Henceforth it has become one of the most researched topics considering the possibilities it contained for the proper accessing of any incidence. The importance of the said topic can be attributed to the fact that people rarely provide any false information in the social sites and pour their emotions according to their knowledge and wisdom. This paper presents the experiments carried out at National Institute of Technology Agartala as part of the participation in the Forum for Information Retrieval Evaluation (FIRE) 2016 in Information Extraction from Microblogs Posted during Disasters [14]. The experiments carried out by us for FIRE 2016 are based on stemming, zonal indexing, theme identification, TF-IDF based ranking model and positional information. The data contained 48845 tweets out of 50,000 tweets mentioned in the workshop website. Query was provided by the organizing committee and each query was specified using title, narration and description format. 2 RELATED WORKS The problem of Information Extraction from Microblogs Posted during Disasters is researched for a couple of years starting from 2010 by Sarah Vieweg et.al. [1] and Leysia Palen et.al.[2]. But there has been tremendous work since then and a new field of information retrieval has come into existence. Sudha Verma et.al. wrote on Situational Awareness through tweets [3]. The research on location of disaster hit area, the response and the information extraction has been going on since then [4][5][6][7]. One of the important part of the information retrieval part is the part of speech tagging in the code mixed microblog data[8][9][10][11][12][13]. Several researchers even work on information extraction from mixed script analysis in social media websites and forums. English used to dominate the micro blogging sites previously such as Twitter and Facebook. 3 TASK DESCRIPTION A large set of microblogs (tweets) posted during a recent disaster event was be made available, along with a set of topics (in TREC format). Each ‘topic’ identified a broad information need during a disaster, such as – what resources are needed by the population in the disaster affected area, what resources are available, what resources are required / available in which geographical region, and so on. Specifically, each topic contained a title, a brief description, and a more detailed narrative on what type of tweets will be considered relevant to the topic. The participants are required to develop methodologies for extracting tweets that are relevant to each topic with high precision as well as high recall. The data contained: • Around 50,000 microblogs (tweets) from Twitter, those were posted during the Nepal earthquake in April 2015. Only the tweetids of the tweets was provided, along with a script that was be used to download the tweets using the Twitter API. Out of 50,000 tweets only 48845 could be downloaded on my experimental setup. • A set of 5 – 8 topics in TREC format, each containing a title, a brief description, and a more detailed narrative. 4 METHODOLOGY For the given task we created the required searching configuration on Apache Nutch 0.9 which is a highly extensible and scalable open source web crawler software project. The implementation of the said task is done in two steps. First is creating the searching environment, Secondly apply appropriate test queries to search the results from the previously configured Nutch using Tomcat server. 4.1 Preparation of the data The python script provided by the organizers helped to get the tweets. But file generated was of json type. So another script had