How Deep Data Becomes Big Data Marcin Szczuka * and Dominik ´ Sl¸ezak *† * Institute of Mathematics, University of Warsaw ul. Banacha 2, 02-097 Warsaw, Poland Infobright Inc. ul. Krzywickiego 34/219, 02-078 Warsaw, Poland Email: szczuka@mimuw.edu.pl; slezak@infobright.com Abstract —We present some problems and solutions for situations when compound and semantically rich nature of data records, such as scientific articles, creates challenges typical for big data processing. Using a case study of named entity matching in SONCA system we show how big data problems emerge and how they are solved by bringing together methods from database management and computational intelligence. I. Introduction Traditionally, when we think about the problems posed by the BIG DATA, we have in mind either the amount of records (data length) or the size of records (data width), or both. Other traditional issues include the speed of data acquisition and volatility of data [1]. In this article, how- ever, we would like to address the problems and challenges that arise, when we are faced with data that may not be very large, but is very rich. This can be referred to as deep data, by analogy to the notion of Deep Web [2] used to describe vast amount of information that lays under the surface of Web pages. If our task is to process and analyze a deep data set, we may quickly find out, that this creates various problems that are typically found in big data processing. The fact that there may be a plethora of relations within a scope of a single data entity (record) as well as between parts of different records, quickly leads to “explosion” of demand for creating more and more entities in the data warehouse in order to preserve this information for further use. This paper shows how relatively small collection of data may expand into really big one once we attempt to repre- sent its internal structure and meaning. We use an example of the system created as a part of our R&D project in the area of semantic search and analytics. We demonstrate a huge expansion in the size and variety of information while extracting the representation of semantic knowledge from initial data (scientific articles). We also show how the needs for processing this data can be catered. Supported by the grant SP/I/1/77065/10 in frame of the strategic scientific research and experimental development program: “Inter- disciplinary System for Interactive Scientific and Scientific-Technical Information” founded by Polish National Centre for Research and Development (NCBiR), as well as grants 2011/01/B/ST6/03867 and 2012/05/B/ST6/03215 from Polish National Science Centre (NCN). The exploration in the semantic knowledge contained in deep data leads to emergence of various big data issues. One of particular interests to us (in scope of this paper) is instance matching. While digging deep into structure of semantic relationships in articles, more and more data instances corresponding to various entities (persons, insti- tutions, publishers, journals, topics, etc.) are created. This whole process is asynchronous and may yield multiple and non-identical data instances that all correspond to one and the same real-life object. Matching, which may be viewed as a special kind of data deduplication and cleansing [3], [4], aims at finding and unifying such instances. As the original data is inherently imprecise, so is the resulting knowledge. This makes standard, on-load deduplication approaches infeasible due to prohibitive cost of certain data transformations. The entities that we are dealing with are by definition compound, multi-level, noisy and incomplete. The amount and complexity of comparisons needed to sort out such data is enormous. Hence, in order to match instances one has to resort to approximate, off- line methods originating in areas such as: soft computing, granular computing, approximate reasoning and so on. The matching process results in creation of data objects that are aggregations of information contained in original data instances. With these objects – if necessary, rolling back to instances that match them – we progress to further steps of creation of complete semantic search system. Objects are the input for procedures tasked with creation of semantic search indices, analysis of data trends, and various types of classification. A collection of data objects that we create is by definition as deep or even deeper than the raw data we started with. Thus, the problems with deep data becoming big one are with us all the way. The paper is organized as follows. In Section II, we outline basic ideas behind our system for semantic search and analytics. In Section III, using an example of scientific articles loaded into our system, we show how the search for semantic relationships may lead toward big data problems. In particular, we discuss how we can cope with the task of named entity matching using approximate methods. In Section IV, we present a case study done on part of the PubMed database. Section V concludes our work. 579