, , , , e-mail: quyetict@utehy.edu.vn, sinhngoc.nguyen@gmail.com, truongnguyengiang.bk@gmail.com, kyungbaekkim@jnu.ac.kr Design of a Disaster Big Data Platform for Collecting and Analyzing Social Media Van-Quyet Nguyen, Sinh-Ngoc Nguyen, Giang-Truong Nguyen, Kyungbaek Kim Dept. of Electronics and Computer Engineering, Chonnam National University Abstract Recently, during disasters occurrence, dealing with emergencies has been handled well by the early transmission of disaster relating notifications on social media networks (e.g., Twitter or Facebook). Intuitively, with their characteristics (e.g., real-time, mobility) and big communities whose users could be regarded as volunteers, social networks are proved to be a crucial role for disasters response. However, the amount of data transmitted during disasters is an obstacle for filtering informative messages; because the messages are diversity, large and very noise. This large volume of data could be seen as Social Big Data (SBD). In this paper, we proposed a big data platform for collecting and analyzing disaster data from SBD. Firstly, we designed a collecting module; which could rapidly extract disaster information from the Twitter; by big data frameworks supporting streaming data on distributed system; such as Kafka and Spark. Secondly, we developed an analyzing module which learned from SBD to distinguish the useful information from the irrelevant one. Finally, we also designed a real-time visualization on the web interface for displaying the results of analysis phase. To show the viability of our platform, we conducted experiments of the collecting and analyzing phases in 10 days for both real-time and historical tweets, which were about disasters happened in South Korea. The results prove that our big data platform could be applied to disaster information based systems, by providing a huge relevant data; which can be used for inferring affected regions and victims in disaster situations, from 21.000 collected tweets. 1. Introduction In the past few years, a number of studies have focused on collecting and analyzing information of social media data for detecting disasters using machine learning algorithms. Those researches utilized Twitter data, which accompanies with well-known limitations such as demographic bias, is a particular interest. Sakaki et al. [1] investigated the real-time nature of Twitter for earthquake event detection by applying Kalman filtering and particle filtering to estimate the center of the burst earthquake. However, users are required to specify explicitly the detected events. And a new classifier needs to be trained to detect new events, which makes it difficult to be extended. Imran et al. [2] employed machine learning for successfully extracting structured information from unstructured, text-based Twitter messages, and compared their results with manual classifications based on crowdsourcing. Vieweg et al. [3] analyzed Twitter messages during the flooding of the Red River Valley in the US and Canada in 2009 seeking to discern activity patterns and extract useful information. Starbird et al. [4] did not only test the hypothesis, in which crowd behaviors could be served as a collaborative filter for identifying people tweeting, but also found that machine learning techniques could be effective in identifying those who are some systems have been proposed for collecting and analyzing disaster information, they are still restricted either in the type of data (e.g., only historical data), the type of storage (e.g., only one local disk on a single computer), or the size of data they could handle. In this paper, we proposed a big data platform for collecting and analyzing disaster data from SBD. Our work makes the following contributions: Firstly, we designed and implemented a collecting module which rapidly extracts disaster information from the Twitter by big data frameworks that support streaming data on distributed system such as Kafka and Spark. Secondly, we implemented algorithms for analyzing tweets to distinguish the useful information from irrelevant one. We adapt keyword-based and topic-based filtering which are the common approaches for analyzing Twitter messages. Finally, we designed and implemented a real-time visualization on web interface for illustrating the results of analysis phase. 2017년 춘계학술발표대회 논문집 제24권 제1호(2017. 4) - 661 -