DFRWS 2020 EU e Proceedings of the Seventh Annual DFRWS Europe Big Data Forensics: Hadoop 3.2.0 Reconstruction Edward Harshany a, * , Ryan Benton a , David Bourrie a , William Glisson b . a University of South Alabama, School of Computing, Mobile, 36688, USA; b Sam Houston State University, College of Science and Engineering Technology, Huntsville, 77342, USA Keywords Hadoop Forensics Big data Reconstruction abstract Conducting digital forensic investigations in a big data distributed le system environment presents signicant challenges to an investigator given the high volume of physical data storage space. Presented is an approach from which the Hadoop Distributed File System logical le space is mapped to the physical data location. This approach uses metadata collection and analysis to reconstruct events in a nite time series. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http:// creativecommons.org/licenses/by-nc-nd/4.0/).. * Corresponding author. 1. Introduction Metadata management is vital to the Hadoop Distributed File System (HDFS). HDFS is designed to centrally manage all distributed le system metadata through the master server called the Namenode. The metadata details the structure of the distributed le system abstraction through le and directory attributes, mapping of data to data storage locations, and namespace hierarchy (Hadoop e Apache Hadoop 3, 2019). Expeditious assimilation of metadata is critical to successful data evidence recovery for a distributed system with high ingestion rates (Grispos et al., 2013). HDFS is at the core of the Hadoop ecosystem which has evolved through version releases (White, 2015). Hadoop 3.2.0 version, released in 2018, includes 1) Hadoop Common containing utilities to support Hadoop modules, 2) HDFS, the Hadoop distributed le system handling architec- ture, 3) MapReduce, for cluster resource managing and data processing, and 4) Yet Another Resource Negotiator (YARN) another resource manager layer forming a data-computation framework (Hadoop e Apache Hadoop 3, 2019). The Hadoop ecosystem is comprised of four main layers employed in varying congurations providing data storage and processing solutions (White, 2015). This study aims to investigate the effectiveness of utilizing a subset of metadata generated at the HDFS data storage layer to reconstruct le system operations and map data to physical data location. Once mapped, data evidence could be prioritized and targeted for preservation or further analysis. 2. Methodology and experimental setting Methods were conned to the construction of directories and le opera- tions addition and deletion in a specic order. The timeline creates a directory structure within the HDFS namespace, adds les to the HDFS namespace, and deletes specic les from the HDFS namespace. The goal is to reconstruct the sequence of operations over this time period and discover le locations from the HDFS metadata. Fig. 1 shows the setting congured in a fully distributed mode with Hadoop 3.2.0. on available commodity hardware, each running 64-bit Ubuntu 18.0.4.2 operating system with version 4.15 Linux kernel. Data block ID is used within the logical namespace to identify data blocks within a le. The Datanode local le system uses block ID as the le name to create les in its native le system and store in the Ext4 le's INode structure. Block replicas on Datanodes are represented by two les in the local le system. One contains the data itself and the other records met- adata including checksums for the data and the generation stamp (Sremack, 2015). 3. Analysis and ndings HDFS image and edits les were recovered from the live system and converted to.XML les for ofine analysis. These les contain serialized image data that must be converted for viewing. The XML les were then read by a parser developed to extract attributes. The most recent discov- ered image represented the most recent state as all edits were updated on the image. The image contains information on each individual INode including type, name, replication, timestamp information, associated block information, and sequential generation stamp for each data block present. This information can be compared with the INode directory section and a resultant HDFS logical le system namespace can be reconstructed. The processed image data structures reveal absent inodes, data blocks, and generation stamps indicating le system modication from a previous state. Generation stamps are associated with data blocks during data block URL: http://eh1721@jagmail.southalabama.edu Contents lists available at ScienceDirect Forensic Science International: Digital Investigation journal homepage: www.elsevier.com/locate/fsidi https://doi.org/10.1016/j.fsidi.2020.300909 Forensic Science International: Digital Investigation 32 (2020) 300909