An Access Control Scheme for Big Data Processing Vincent C. Hu, Tim Grance, David F. Ferraiolo, D. Rick Kuhn National Institute of Standards and Technology Gaithersburg, MD, USA vhu, grance, dferraiolo, kuhn@nist.gov AbstractAccess Control (AC) systems are among the most critical of network security components. A system’s privacy and security controls are more likely to be compromised due to the misconfiguration of access control policies rather than the failure of cryptographic primitives or protocols. This problem becomes increasingly severe as software systems become more and more complex, such as Big Data (BD) processing systems, which are deployed to manage a large amount of sensitive information and resources organized into a sophisticated BD processing cluster. Basically, BD access control requires the collaboration among cooperating processing domains to be protected as computing environments that consist of computing units under distributed AC managements. Many BD architecture designs were proposed to address BD challenges; however, most of them were focused on the processing capabilities of the three Vs(Velocity, Volume, and Variety). Considerations for security in protecting BD are mostly ad hoc and patch efforts. Even with some inclusion of security in recent BD systems, a critical security component, AC (Authorization), for protecting BD processing components and their users from the insider attacks, remains elusive. This paper proposes a general purpose AC scheme for distributed BD processing clusters. KeywordsAccess Control, Authorization, Big Data, Distributed System I. INTRODUCTION Data IQ News [1] estimates that the global data population will reach 44 zettabytes (1 billion terabytes) by 2020. This growth trend is influencing the way data is being mass collected and produced for high-performance computing or operations and planning analysis. Big Data (BD) refers to large data that is difficult to process by using a traditional data processing system, for example, to analyze Internet data traffic, or edit video data of hundreds of gigabytes. (Note that each case depends on the capabilities of a system; it has been argued that for some organizations, terabytes of text, audio, and video data per day can be processed, thus, it is not BD, but for those organizations that cannot process efficiently, it is BD [2]). BD technology is gradually reshaping current data systems and practices. Government Computer News [3] estimates that the volume of data stored by federal agencies alone will increase from 1.6 to 2.6 petabytes within two years, and U.S. state and local governments are just as keen on harnessing the power of BD to boost security, prevent fraud, enhance service delivery, and improve emergency response. It is estimated that successfully leveraging technologies for BD can reduce the IT cost by an average of 48% [4]. BD has denser and higher resolutions such as media, photos, and videos from sources such as social media, mobile applications, public records, and databases; the data is either in static batches or dynamically generated by machine and users by the advanced capacities of hardware, software, and network technologies. Examples include data from sensor networks or tracking user behavior. Rapidly increasing volumes of data and data objects add enormous pressure on existing IT infrastructures with scaling difficulties such as capabilities for data storage, advance analysis, and security. These difficulties result from BD’s large and growing files, at high speed, and in various formats, as is measured by: Velocity (the data comes at high speed, e.g., scientific data such as data from weather patterns.); Volume (the data results from large files, e.g., Facebook generates 25TB of data daily.); and Variety (the files come in various formats: audio, video, text messages, etc. [2]). Therefore, BD data processing systems must be able to deal with collecting, analyzing, and securing BD data that requires processing very large data sets that defy conventional data management, analysis, and security technologies. In simple ways, some solutions use a dedicated system for their BD processing. However, to maximize scalability and performance, most BD processing systems apply massively parallel software running on many commodity computers in distributed computing frameworks that may include columnar databases and other BD management solutions [5]. Access Control (AC) systems are among the most critical of network security components. It is more likely that privacy or security will be compromised due to the misconfiguration of access control policies than from a failure of a cryptographic primitive or protocol. This problem becomes increasingly severe as software systems become more and more complex such as BD processing systems, which are deployed to manage a large amount of sensitive information and resources organized into a sophisticated BD processing cluster. Basically, BD AC systems require collaboration among corporate processing domains as protected computing environments, which consist of computing units under distributed AC management [6]. Many architecture designs have been proposed to address BD challenges; however, most of them have been focused on the processing capabilities of the three Vs(Velocity, Volume, and Variety). Considerations for security in protecting BD AC are mostly ad hoc and patch efforts. Even with the inclusion of some security capability in recent BD systems, COLLABORATECOM 2014, October 22-25, Miami, United States Copyright © 2014 ICST DOI 10.4108/icst.collaboratecom.2014.257649