Capturing Users’ Everyday, Implicit Information Integration Decisions David W. Archer Lois M. L. Delcambre Department of Computer Science Portland State University Portland, OR 97207 {darcher, lmd} @ cs.pdx.edu Abstract Integration of large databases by expert teams is only a small part of the data integration activities that take place. Users without data integration expertise very often gather, organize, reconcile, and use diverse information as a normal part of their jobs. Often, they do this by copying data into a text file or spreadsheet. In doing so, they make significant data integration decisions. They often express a mental model, or schema, over their data. They organize data to describe real-world entities. They reconcile redundancy and disagreements in their data. Such integration is both ubiquitous and not generally supported by experts and tools available for large integration efforts. We seek to capture and make explicit the user’s mental model, and the attribute and entity correspondences they express, during these activities. This paper contributes the definition of a set of functions that support this type of data integration, a conceptual model to support these functions, and an associated simple tool that supports data integration by end-users in an entity-centric way, with an extensible schema, that makes the user’s job easier. Keywords : information integration, entity resolution, data correspondence, superimposed information. 1 Introduction We often consider data integration to be the province of the DBA and the integration expert, aided by specialized tools. However, data integration is often performed as part of everyday user tasks. End users gather, organize, reconcile, and use information from diverse sources all the time. They gather information from office documents, the Web, e-mail messages, and other places in order to inform, to influence, and to make decisions. Sometimes users gather information using hardcopy. Increasingly, users take advantage of computers to do this work, often cutting and pasting bits of information into a spreadsheet or text file. . The information gathered, and the semantics given it, are task-dependent. However, our experience shows that Copyright (c) 2007, Australian Computer Society, Inc. This paper appeared at the Twenty-Sixth International Conference on Conceptual Modeling - ER 2007 - Tutorials, Posters, Panels and Industrial Contributions, Auckland, New Zealand. Conferences in Research and Practice in Information Technology, Vol. 83. John Grundy, Sven Hartmann, Alberto H. F. Laender, Leszek Maciaszek and John F. Roddick, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included. users often gather information in much the same way. They collect information about real-world objects or concepts: that is, information gathering tends to be entity- centric. They label information to keep track of its meaning (for the most part users know both the meaning of each piece of information and the entity which it describes). They tend to organize gathered data in simple, tabular form. They frequently make on-the-fly decisions about the data: they combine what initially appear to be disparate real-world objects or concepts into single objects; they combine what initially appear to be different characteristics into single attributes; and they resolve conflicting items of information. Each of these actions helps to superimpose the user’s conceptual model on the gathered information, and adds significant value to the raw data. We seek to capture the information labeling and information integration decisions that are expressed during these activities, in the form of schema definition, attribute correspondences, entity correspondences, and attribute value conflict resolutions. In this research, we seek to: - make the user’s tasks of using and manipulating information drawn from various sources easier, - provide direct support for tracking the lineage of information extraction, use, and re-packaging, and ultimately, - exploit the collective user integration information to help solve the general problem of massive information integration. In this paper, Section 2 describes our understanding of user data integration activities, our refinement of this understanding into a set of specific user actions, and our definition of a conceptual model to support these actions. Section 3 formalizes these user actions into a set of functions on our conceptual model, encompassing entity resolution, attribute resolution, and attribute value conflict resolution. Section 4 describes a simple tool for evaluating both how to support users in performing these tasks, and how to capture the integration metadata they express during these tasks. Section 5 discusses related work from the literature. Section 6 summarizes our contribution and discusses future efforts. 2 Conceptual Model In order to examine how users gather and organize information, we rely on ten years of one author’s direct observation and participation in managing engineering organizations and running a successful software business.