DFRWS 2020 EU e Proceedings of the Seventh Annual DFRWS Europe Big Data Forensics: Hadoop 3.2.0 Reconstruction Edward Harshany a, * , Ryan Benton a , David Bourrie a , William Glisson b . a University of South Alabama, School of Computing, Mobile, 36688, USA; b Sam Houston State University, College of Science and Engineering Technology, Huntsville, 77342, USA Keywords Hadoop Forensics Big data Reconstruction abstract Conducting digital forensic investigations in a big data distributed ﬁle system environment presents signiﬁcant challenges to an investigator given the high volume of physical data storage space. Presented is an approach from which the Hadoop Distributed File System logical ﬁle space is mapped to the physical data location. This approach uses metadata collection and analysis to reconstruct events in a ﬁnite time series. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http:// creativecommons.org/licenses/by-nc-nd/4.0/).. * Corresponding author. 1. Introduction Metadata management is vital to the Hadoop Distributed File System (HDFS). HDFS is designed to centrally manage all distributed ﬁle system metadata through the master server called the Namenode. The metadata details the structure of the distributed ﬁle system abstraction through ﬁle and directory attributes, mapping of data to data storage locations, and namespace hierarchy (Hadoop e Apache Hadoop 3, 2019). Expeditious assimilation of metadata is critical to successful data evidence recovery for a distributed system with high ingestion rates (Grispos et al., 2013). HDFS is at the core of the Hadoop ecosystem which has evolved through version releases (White, 2015). Hadoop 3.2.0 version, released in 2018, includes 1) Hadoop Common containing utilities to support Hadoop modules, 2) HDFS, the Hadoop distributed ﬁle system handling architec- ture, 3) MapReduce, for cluster resource managing and data processing, and 4) Yet Another Resource Negotiator (YARN) another resource manager layer forming a data-computation framework (Hadoop e Apache Hadoop 3, 2019). The Hadoop ecosystem is comprised of four main layers employed in varying conﬁgurations providing data storage and processing solutions (White, 2015). This study aims to investigate the effectiveness of utilizing a subset of metadata generated at the HDFS data storage layer to reconstruct ﬁle system operations and map data to physical data location. Once mapped, data evidence could be prioritized and targeted for preservation or further analysis. 2. Methodology and experimental setting Methods were conﬁned to the construction of directories and ﬁle opera- tions addition and deletion in a speciﬁc order. The timeline creates a directory structure within the HDFS namespace, adds ﬁles to the HDFS namespace, and deletes speciﬁc ﬁles from the HDFS namespace. The goal is to reconstruct the sequence of operations over this time period and discover ﬁle locations from the HDFS metadata. Fig. 1 shows the setting conﬁgured in a fully distributed mode with Hadoop 3.2.0. on available commodity hardware, each running 64-bit Ubuntu 18.0.4.2 operating system with version 4.15 Linux kernel. Data block ID is used within the logical namespace to identify data blocks within a ﬁle. The Datanode local ﬁle system uses block ID as the ﬁle name to create ﬁles in its native ﬁle system and store in the Ext4 ﬁle's INode structure. Block replicas on Datanodes are represented by two ﬁles in the local ﬁle system. One contains the data itself and the other records met- adata including checksums for the data and the generation stamp (Sremack, 2015). 3. Analysis and ﬁndings HDFS image and edits ﬁles were recovered from the live system and converted to.XML ﬁles for ofﬂine analysis. These ﬁles contain serialized image data that must be converted for viewing. The XML ﬁles were then read by a parser developed to extract attributes. The most recent discov- ered image represented the most recent state as all edits were updated on the image. The image contains information on each individual INode including type, name, replication, timestamp information, associated block information, and sequential generation stamp for each data block present. This information can be compared with the INode directory section and a resultant HDFS logical ﬁle system namespace can be reconstructed. The processed image data structures reveal absent inodes, data blocks, and generation stamps indicating ﬁle system modiﬁcation from a previous state. Generation stamps are associated with data blocks during data block URL: http://eh1721@jagmail.southalabama.edu Contents lists available at ScienceDirect Forensic Science International: Digital Investigation journal homepage: www.elsevier.com/locate/fsidi https://doi.org/10.1016/j.fsidi.2020.300909 Forensic Science International: Digital Investigation 32 (2020) 300909