An Access Control Scheme for Big Data Processing
Vincent C. Hu, Tim Grance, David F. Ferraiolo, D. Rick Kuhn
National Institute of Standards and Technology
Gaithersburg, MD, USA
vhu, grance, dferraiolo, kuhn@nist.gov
Abstract— Access Control (AC) systems are among the most
critical of network security components. A system’s privacy and
security controls are more likely to be compromised due to the
misconfiguration of access control policies rather than the failure
of cryptographic primitives or protocols. This problem becomes
increasingly severe as software systems become more and more
complex, such as Big Data (BD) processing systems, which are
deployed to manage a large amount of sensitive information and
resources organized into a sophisticated BD processing cluster.
Basically, BD access control requires the collaboration among
cooperating processing domains to be protected as computing
environments that consist of computing units under distributed
AC managements. Many BD architecture designs were proposed
to address BD challenges; however, most of them were focused
on the processing capabilities of the “three Vs” (Velocity,
Volume, and Variety). Considerations for security in protecting
BD are mostly ad hoc and patch efforts. Even with some
inclusion of security in recent BD systems, a critical security
component, AC (Authorization), for protecting BD processing
components and their users from the insider attacks, remains
elusive. This paper proposes a general purpose AC scheme for
distributed BD processing clusters.
Keywords—Access Control, Authorization, Big Data,
Distributed System
I. INTRODUCTION
Data IQ News [1] estimates that the global data population
will reach 44 zettabytes (1 billion terabytes) by 2020. This
growth trend is influencing the way data is being mass
collected and produced for high-performance computing or
operations and planning analysis. Big Data (BD) refers to large
data that is difficult to process by using a traditional data
processing system, for example, to analyze Internet data traffic,
or edit video data of hundreds of gigabytes. (Note that each
case depends on the capabilities of a system; it has been argued
that for some organizations, terabytes of text, audio, and video
data per day can be processed, thus, it is not BD, but for those
organizations that cannot process efficiently, it is BD [2]). BD
technology is gradually reshaping current data systems and
practices. Government Computer News [3] estimates that the
volume of data stored by federal agencies alone will increase
from 1.6 to 2.6 petabytes within two years, and U.S. state and
local governments are just as keen on harnessing the power of
BD to boost security, prevent fraud, enhance service delivery,
and improve emergency response. It is estimated that
successfully leveraging technologies for BD can reduce the IT
cost by an average of 48% [4].
BD has denser and higher resolutions such as media,
photos, and videos from sources such as social media, mobile
applications, public records, and databases; the data is either in
static batches or dynamically generated by machine and users
by the advanced capacities of hardware, software, and network
technologies. Examples include data from sensor networks or
tracking user behavior. Rapidly increasing volumes of data and
data objects add enormous pressure on existing IT
infrastructures with scaling difficulties such as capabilities for
data storage, advance analysis, and security. These difficulties
result from BD’s large and growing files, at high speed, and in
various formats, as is measured by: Velocity (the data comes at
high speed, e.g., scientific data such as data from weather
patterns.); Volume (the data results from large files, e.g.,
Facebook generates 25TB of data daily.); and Variety (the files
come in various formats: audio, video, text messages, etc. [2]).
Therefore, BD data processing systems must be able to deal
with collecting, analyzing, and securing BD data that requires
processing very large data sets that defy conventional data
management, analysis, and security technologies. In simple
ways, some solutions use a dedicated system for their BD
processing. However, to maximize scalability and
performance, most BD processing systems apply massively
parallel software running on many commodity computers in
distributed computing frameworks that may include columnar
databases and other BD management solutions [5].
Access Control (AC) systems are among the most critical
of network security components. It is more likely that privacy
or security will be compromised due to the misconfiguration of
access control policies than from a failure of a cryptographic
primitive or protocol. This problem becomes increasingly
severe as software systems become more and more complex
such as BD processing systems, which are deployed to manage
a large amount of sensitive information and resources
organized into a sophisticated BD processing cluster. Basically,
BD AC systems require collaboration among corporate
processing domains as protected computing environments,
which consist of computing units under distributed AC
management [6].
Many architecture designs have been proposed to address
BD challenges; however, most of them have been focused on
the processing capabilities of the “three Vs” (Velocity,
Volume, and Variety). Considerations for security in protecting
BD AC are mostly ad hoc and patch efforts. Even with the
inclusion of some security capability in recent BD systems,
COLLABORATECOM 2014, October 22-25, Miami, United States
Copyright © 2014 ICST
DOI 10.4108/icst.collaboratecom.2014.257649