Concept of Operations for Knowledge Discovery from “Big Data” Across Enterprise Data Warehouses Sreenivas R. Sukumar, Mohammed M. Olama, Allen W. McNair and James J. Nutaro Computational Sciences and Engineering Division, Oak Ridge National Laboratory 1 Bethel Valley Road, Oak Ridge, TN, USA, 37831 Email: {sukumarsr@ornl.gov, olamahussemm@ornl.gov, mcnairaw@ornl.gov, nutarojj@ornl.gov} ABSTRACT The success of data-driven business in government, science, and private industry is driving the need for seamless integration of intra and inter-enterprise data sources to extract knowledge nuggets in the form of correlations, trends, patterns and behaviors previously not discovered due to physical and logical separation of datasets. Today, as volume, velocity, variety and complexity of enterprise data keeps increasing, the next generation analysts are facing several challenges in the knowledge extraction process. Towards addressing these challenges, data-driven organizations that rely on the success of their analysts have to make investment decisions for sustainable data/information systems and knowledge discovery. Options that organizations are considering are newer storage/analysis architectures, better analysis machines, redesigned analysis algorithms, collaborative knowledge management tools, and query builders amongst many others. In this paper, we present a concept of operations for enabling knowledge discovery that data-driven organizations can leverage towards making their investment decisions. We base our recommendations on the experience gained from integrating multi-agency enterprise data warehouses at the Oak Ridge National Laboratory to design the foundation of future knowledge nurturing data-system architectures. Keywords: “Big Data”, data integration, multi-agency data integration, concept of operations 1. INTRODUCTION The 'Big Data' analysis grand challenge is the discovery of "the hidden laws and processes” underlying observable behaviors by creating the ability to understand, interpret, query and model the activity of the complex systems generating the data. From a data scientist's perspective, this translates to the design of scalable infrastructure and intelligent algorithms to build predictive models of hindsight, insight and foresight. Data-science driven success stories such as Kroger and DunnHumby’s loyalty program [1] that integrated customer transaction data with psychographic and demographic data to design their targeted market campaign and Capital One’s credit assessment program [2] are paving the future for data-driven business both in the industry and in the government. Our motivation behind this paper is to share our understanding of these success stories along with our own personal experience in integrating multi-agency data for the government, to answer the following question – How should organizations and analysts construct their concept of operations in a principled, methodical and sustainable manner to enable knowledge discovery from ‘Big Data’ across enterprise data warehouses? We begin by stating that there has been a paradigm shift in the knowledge discovery workflow in the recent years. We illustrate the shift as one of moving from the data-to-decisions-process triangle to a constantly-at-work gear system as displayed in Figure 1. Traditional information technology (IT) systems assume the workflow where analysts once provided with access, filter the data by applying domains specific logic, and transform the data into knowledge by applying advanced algorithms that leads to actionable discoveries. As depicted it is a linear bottom-up flow towards the apex of the triangle. The output from the data analysis is either a very small subset of the data or a report of the results after executing a query of interest. Today, however, we are learning that knowledge discovery is a limitless thirst. Data drives analysis and discoveries from the analysis drive the need to collect and integrate more data. Interesting discoveries that emerge out of clever integration of data from different sources result in integrated data products that are bigger and more valuable than the original datasets. In other words, analysis and discovery from datasets, especially ‘Big Data’ only produces ‘Bigger Data’. Please verify that (1) all pages are present, (2) all figures are correct, (3) all fonts and special characters are correct, and (4) all text and figures fit within the red margin lines shown on this review document. Complete formatting information is available at http://SPIE.org/manuscripts Return to the Manage Active Submissions page at http://spie.org/app/submissions/tasks.aspx and approve or disapprove this submission. Your manuscript will not be published without this approval. Please contact author_help@spie.org with any questions or concerns. 8758 - 10 V. 1 (p.1 of 9) / Color: No / Format: Letter / Date: 5/6/2013 1:02:40 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes: