A Framework for Classifying IPFIX Flow Data, Case KNN Classifier Jussi Nieminen, Jorma Ylinen, Timo Seppälä, Teemu Alapaholuoma, Pekka Loula Telecommunication Research Center Tampere University of Technology, Pori Unit Pori, Finland jussi.nieminen@tut.fi, jorma.ylinen@tut.fi, timo.a.seppala@tut.fi, teemu.alapaholuoma@tut.fi, pekka.loula@tut.fi Abstract — Flow-level measurement applications and analysis in IP networks are inevitably gaining popularity, due to the unstoppable increase in the amount of transmitted data on the Internet. It is not reasonable or even possible to examine each and every packet traversing through a network. Our research focuses on passive flow level data classification and characteristic identification. To be more exact, our goal is to design a framework for extracting certain classes, feature(s) and behavior from IP flow data. One of the goals is to achieve this without examining the payload of any of the IP packets and without compromising the anonymity of the flow counterparts. Traditionally, Deep Packet Inspection or port mapping techniques have been applied for this purpose. In this paper, we present an alternative framework for classifying the IP traffic, which we aim to utilize in the future for separating classes from the IP traffic for information security purposes. Keywords-Flow; IP; IPFIX; KNN; Classification I. INTRODUCTION In this paper, we study the possibility of identifying traffic characteristics from IP traffic, and more precisely from the IP/TCP/UDP/ICMP header data. We utilize the KNN Classifier method (K Nearest Neighbors) through passive data analysis on IPFIX [1] [2] flow data. The motivation for our research comes from the area of information security. We are keen on finding methods for separating classes from the data in order to be able to identify a measurable unit (IPFIX flow in this case) for example as normal or malicious in future analysis work. In this paper, we present a framework, which can be utilized for that purpose. Our research relies on total anonymity. The IP- addresses are either anonymized or cut off prior to analysis execution. The payload of each IP packet is cut off in the data capture phase, so all the details compromising the user privacy of the connection counterparts are discarded. The KNN Classifier method determines the class of a new data point based on its K-nearest neighbors in a selected feature space. The class that exists the most among the K- nearest neighbors is given to the test data point. The KNN Classifier is based simply on the distance metric of data points. The Euclidean distance metric is the most common one, while also other metric methods are available. This obviously means that a variety of different KNN implementations have been introduced. Our data for the analysis was captured from a large-scale local area network. The selected network is known to have a large amount of hosts and good set of services active. It is also known that the information security policy doesn’t restrict the usage of any service in the network. This is a clear advantage from the analysis point of view, because the captured data is as pure as it can be without any restrictions or filtering in any way at any point. The data was captured from the network and stored to disk in IPFIX format. In the analysis phase the data was first divided into two classes. We use a class distribution of WWW-type traffic versus other traffic in this paper. WWW as a service provides interesting viewpoints for future analysis, as it is commonly used, uses standard port numbers, and therefore also has a lot of information security aspects. The following step was to select the parameters for the classification execution. K-fold cross-validation was used as the classification framework to determine the best value for the constant ‘K’ in KNN-Classifier. Another important factor was to select suitable input parameters (features) for the classification. We came up with a set of three parameters. Once the parameters were selected, the actual classification was executed. As a result, the details were obtained about how the classification succeeded. The results were studied and written down, along with conclusions and observations about the functionality of the analysis framework and the methods used. Based on the analysis, we present our framework for classifying IP Flow data. In addition, some thoughts on how the results could be utilized in practice are provided. This paper consists of seven sections. In the next section, the related work in the field of IP-traffic data classification is presented and analyzed briefly. In Section three, the data is presented in terms of how the data is obtained, how it is pre- processed, what is the total amount of data and how it is connected to real life time-wise. The theory behind the analysis is presented in Section four. Section five presents the analysis framework and the execution of each step during the analysis. The observations and results of the analysis are presented in Section six. Finally, conclusions and future plans are given in Section seven. II. RELATED WORK The quest for finding solutions for extracting IP-traffic characteristics from IP traffic has been a challenge for 14 Copyright (c) IARIA, 2012. ISBN: 978-1-61208-186-1 ICNS 2012 : The Eighth International Conference on Networking and Services