Traffic Measurement and Analysis with Hadoop Dipti J. Suryawanshi, Prof. Mr. U. A. Mande Department of Computer Engineering, Sinhgad College of Engineering, University of Pune, India Department of Computer Engineering, Sinhgad College of Engineering, University of Pune, India ABSTRACT: In computer network, network traffic measurement is the process of measuring the amount and type of traffic on a particular network. Nowadays internet traffic measurements and analysis are mostly used to characterize and analysis of network usage and user behaviors, but faces the problem of scalability under the explosive growth of Internet traffic and high-speed access. As the number of network elements, such as routers, switches, and user devices, has increased and their performance has improved rapidly, it has become more and more difficult for Internet Service Providers (ISPs) to collect and analyze efficiently a large data set of raw packet dumps, flow records, activity logs for accounting, management, and security. To satisfy demands for the deep analysis of ever-growing Internet traffic data, ISPs need a traffic measurement and analysis system where the computing and storage resources can be scaled out. Scalable Internet traffic measurement and analysis is difficult because a large data set requires matching computing and storage resources. Hadoop, an open-source computing platform of MapReduce and a distributed file system, has become a popular infrastructure for massive data analytics because it facilitates scalable data processing and storage services on a distributed computing system consisting of commodity hardware. This paper presents a Hadoop-based traffic monitoring system that performs IP, TCP, HTTP, and NetFlow analysis of multi-terabytes of Internet traffic in a scalable manner. And also explain the performance issues related with traffic analysis MapReduce jobs. Keywords: Hadoop, Hive, MapReduce, NetFlow, pcap, packet, traffic analysis, traffic measurement I. INTRODUCTION Traffic measurement and analysis required a large amount of data storage and high-performance computing power to manage a huge amount of traffic data set. Google is one of the search engines that can easily scale out with MapReduce and GFS [10, 11]. MapReduce [10] is a programming model for processing large data sets, generally used for data intensive tasks, by Google. It is typically used to do distribute computing on clusters of computers. MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. Since Google’s MapReduce and Google file system (GFS) [11] are fix, an open-source MapReduce software project, Hadoop [12] an open-source computing platform of MapReduce and a distributed file system, was developed to provide similar capabilities of the Google’s MapReduce platform by using thousands of cluster nodes. Hadoop distributed file system (HDFS) is an important component of Hadoop, that corresponds to GFS. Yahoo!, Amazon, Facebook, IBM, Rackspace, Last.fm, Netflix, and Twitter are using Facebook also uses Hadoop to analyze the web log data for its social network service. Hadoop for its scalability in storage and computing power is a suitable platform for Internet traffic measurement and analysis. In this paper, we develop a Hadoop based scalable internet traffic measurement and analysis system that can manage the packets and NetFlow data on HDFS. By applying Hadoop to an Internet traffic measurement and analysis, we need to face some challenges that are: 1) to parallelize MapReduce I/O of packet dumps and NetFlow records in HDFS-aware manner, 2) to devise traffic analysis algorithms especially for TCP flows dispersed in HDFS, and 3) to design and implementation an integrated Hadoop-based Internet traffic monitoring and analysis system practically useful to operators and researchers. After that we propose a binary input format for reading packet and NetFlow records concurrently in HDFS. Then we present MapReduce analysis algorithm for NetFlow, IP, TCP, HTTP, traffic. After that we prove that how to analyze efficiently the TCP performance metrics in MapReduce in the distributed computing environment. Finally we create web based agile traffic warehousing system using Hive [13] presents a large amount of Internet traffic analysis system with Hadoop that can quickly process IP packets as well as NetFlow data through scalable MapReduce based analysis algorithms for large IP, TCP, and HTTP data. It also show that the data warehousing tool Hive is useful for providing an agile and elastic traffic analysis framework. The remaining part of this paper is organized as follows. In Section 2, we describe the related work on traffic measurement and analysis as well as work on MapReduce and Hadoop. The architecture of Hadoop based traffic measurement and analysis system and its components are explained in Section 3, And the experimental results are presented in Section 4. Finally Section 5 concludes this paper. 2980 International Journal of Engineering Research & Technology (IJERT) Vol. 2 Issue 10, October - 2013 ISSN: 2278-0181 www.ijert.org IJERTV2IS101130