Neural Networks 32 (2012) 275–284 Contents lists available at SciVerse ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet 2012 Special Issue Application of growing hierarchical SOM for visualisation of network forensics traffic data E.J. Palomo a,∗ , J. North b , D. Elizondo b , R.M. Luque a , T. Watson b a Department of Computer Science, University of Malaga, Malaga, Spain b Cyber Security Centre, Department of Computer Technology, De Monfort University, Leicester, United Kingdom article info Keywords: Network forensics Hierarchical self-organisation Data clustering Data visualisation Feature extraction abstract Digital investigation methods are becoming more and more important due to the proliferation of digital crimes and crimes involving digital evidence. Network forensics is a research area that gathers evidence by collecting and analysing network traffic data logs. This analysis can be a difficult process, especially because of the high variability of these attacks and large amount of data. Therefore, software tools that can help with these digital investigations are in great demand. In this paper, a novel approach to analysing and visualising network traffic data based on growing hierarchical self-organising maps (GHSOM) is presented. The self-organising map (SOM) has been shown to be successful for the analysis of highly- dimensional input data in data mining applications as well as for data visualisation in a more intuitive and understandable manner. However, the SOM has some problems related to its static topology and its inability to represent hierarchical relationships in the input data. The GHSOM tries to overcome these limitations by generating a hierarchical architecture that is automatically determined according to the input data and reflects the inherent hierarchical relationships among them. Moreover, the proposed GHSOM has been modified to correctly treat the qualitative features that are present in the traffic data in addition to the quantitative features. Experimental results show that this approach can be very useful for a better understanding of network traffic data, making it easier to search for evidence of attacks or anomalous behaviour in a network environment. © 2012 Elsevier Ltd. All rights reserved. 1. Introduction The network has become a staple method of transferring infor- mation to support both personal and business requirements. How- ever, as different services have been enabled across the network environment, the potential for cyber-crime has grown with these. Unfortunately, not only are criminals exploiting this medium to an unprecedented degree but we are now looking at the potential of cyber-warfare or cyber-terrorism. Digital devices can often be configured to record the traffic and data fed to them in the form of logs. The preservation and extraction of this information in a manner which preserves its integrity and soundness is digital forensics. This information and its interpretation can be used in criminal courts as both a means of defence and prosecution (Kruse & Heiser, 2001). Although digital forensics can take many different forms, this paper specifically looks at a sub-field of forensics involving analysing network traffic. Network forensics typically involves analysing any available audit ∗ Correspondence to: Department of Computer Science, E.T.S.I. Informatica, University of Malaga, Bulevar Louis Pasteur, 35, 29071, Malaga, Spain. Tel.: +34 952 132 847; fax: +34 952 131 397. E-mail address: ejpalomo@lcc.uma.es (E.J. Palomo). trails for the specific streams identifying the offending activity (Mukkamala & Sung, 2003). These audit trails can be created using reconstructive analysis on the log files which can be created by many different devices and software services on the network including routers, firewalls, web-servers and databases. Although it can be seen that this kind of analysis is desirable, it is a non-trivial task (Roussev & III, 2004). There are several reasons for this. One is the amount of data which needs analysing to find potentially very small tell tale signs. This is not just limited to the number of records, but also to the number of different features each record may contain. The other main reason for the difficulty in identifying the offending data is in the pattern that data takes. When analysing datasets there are two distinctive analysis that can be done; the first is to look for known data patterns which correspond to attacks which have been seen before and the second is to look for data attacks which have not been seen or identified before. This paper concentrates on identifying attacks which may, or may not have been seen before; meaning that the form of the data patterns to be identified is not known. The identification of information, or patterns, in large subsets of data is a property of the fields of data-mining and feature extraction. Unsupervised learning techniques are a subset of these fields which enable the identification and grouping of 0893-6080/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2012.02.021