Unstructured information integration through data-driven similarity discovery Rema Ananthanarayanan IBM Research, India arema@in.ibm.com Sreeram Balakrishnan IBM Software Group, US sreevb@us.ibm.com Berthold Reinwald IBM Research, Almaden, US reinwald@almaden.ibm.com Yuen Yee Nuance Communications, Inc yuenyee.lo@nuance.com Abstract Information integration from multiple heteroge- neous sources is one of the major challenges facing enterprises and service providers today, and one of the important problems in this domain is the inte- gration of structured and unstructured (or text) data. In this paper we describe our work on a data-driven approach to integrating various sources of text data, without relying on the availability of schema infor- mation. To this end, we have used various existing tools from natural language processing, data min- ing and related areas in a novel manner. The tools are used at the ’preprocessing’ stage to (a) charac- terise each set of unstructured information (or col- lection of text data), (b) identify the related sets of unstructured information and (c) relate these sets to various reference data sets. All these steps are based solely on the instance values of the data sets. Subsequently the information compiled in the pre- processing stage may be used at query time to query the structured and text data. We also present our results on applying our techniques for data integra- tion across multiple unstructured data sources, re- lating to customer comments of a service provider. 1 Introduction Most techniques developed today for data integration across heterogeneous data sources operate on structured or cate- gorical data, usually made available in relational databases. However a huge proportion of business data resides in un- structured documents spread across the enterprise, such as emails, spreadsheets, facsimiles and other sources. 1 . Non- conventional sources such as blogs and third-party review sites are also increasingly serving as rich sources of infor- mation on trends and opinions for business intelligence. One of the key challenges that enterprises face today is being able to automatically integrate the information from these various heterogenous sources, and query this information seamlessly across the structured and text data, for extracting business in- telligence. In our work here, we look at the problem of in- 1 Some studies estimate that more than 80% of the data in enter- prises is unstructured data tegrating various sources of unstructured information, using only data-driven techniques. Current approaches to data in- tegration based on instance values operate at entity level or record level and we extend the approaches to data set link- ing. Our focus is on being able to compare multiple sets of text data items, analogous to comparing across columns in database tables. Further, just as each column element in the database may be characterised by an ’attribute’ (which could typically be the column name), each data set may also be characterised by one or more attributes. Multiple data sets across different data sources, or even within the same data source, may possess the same attributes. Further, each data set also has ’value’ elements that correspond to the individ- ual instance values comprising that set. When querying for an attribute across data sets from different sources, it is there- fore necessary to ensure that all the data sets described by that attribute are included. In our work, we achieve this by clus- tering the related data sets based on the data contents rather than the column or data set names. Our motivation is 1. Purely data-driven approaches appear more amenable for com- plete end-to-end automation and 2. Where feasible, these methods may subsequently be supple- mented with schema-based integration techniques to achieve better results. Different techniques exist to measure the degree of simi- larity between two or more data sets based on the metadata (or schema information) in most cases, and in a few cases, based on the actual data values themselves. However, these techniques have mainly been restricted to structured or cate- gorical data. Our overall goal here is to Provide a means for data-driven similarity discovery across multiple sources of unstructured data, so that the discovered information may be integrated with the existing schema of the structured in- formation, allowing querying across the structured and text data. Our solution comprises the following steps: • Identify groups of related data sets based on various text- processing and data mining techniques; • Identify attributes of the related data sets, based on comparison with domain-specific reference sets and keyword generation; • Present a view of the various data sets as a single repository, for subsequent querying.