Exploiting Data Semantics to Discover, Extract, and Model Web Sources Jos´ e Luis Ambite, Craig A. Knoblock, Kristina Lerman, Anon Plangprasopchok, Thomas Russ USC Information Sciences Institute 4676 Admiralty Way, Marina del Rey, CA 90292, USA {ambite,knoblock,lerman,plangpra,tar}@isi.edu Cenk Gazen, Steven Minton Fetch Technologies 2041 Rosecrans Ave, El Segundo, CA 90245 {gazen,minton}@fetch.com Mark Carman Faculty of Informatics, University of Lugano Via Bufﬁ 13, CH-6904 Lugano, Switzerland mark.carman@lu.unisi.ch Abstract We describe DEIMOS, a system that automatically dis- covers and models new sources of information. The sys- tem exploits four core technologies developed by our group that makes an end-to-end solution to this problem possible. First, given an example source, DEIMOS ﬁnds other sim- ilar sources online. Second, it invokes and extracts data from these sources. Third, given the syntactic structure of a source, DEIMOS maps its inputs and outputs to semantic types. Finally, it infers the source’s semantic deﬁnition, i.e., the function that maps the inputs to the outputs. DEIMOS is able to successfully automate these steps by exploiting a combination of background knowledge and data semantics. We describe the challenges in integrating separate com- ponents into a uniﬁed approach to discovering, extracting and modeling new online sources. We provide an end-to- end validation of the system in two information domains to show that it can successfully discover and model new data sources in those domains. 1. Introduction An assumption in much of the work on data mining is that a person must ﬁrst ﬁnd and model the information from which an automated system would then perform the data mining. This ﬁrst step can require signiﬁcant effort and must be repeated for each new data source. An alternative that we explore in this paper is to exploit a combination of background knowledge and data semantics to automatically discover and model new sources of information. In this work, we assume that we start with a set of ex- ample sources and semantic descriptions of those sources. These sources could be web services with well deﬁned in- puts and output or even Web forms that take a speciﬁc input and generate a result page as the output. The system is then given the task of ﬁnding additional sources that are similar, but not necessarily identical, to the known source. For ex- ample, the system may already have knowledge about sev- eral weather services and then be given the task of ﬁnding additional weather services that provide additional coverage for the world and building a semantic description of these new weather services that makes it possible to exploit them for additional analysis. This problem can be broken down into four subtasks. First, given an example source, how do we ﬁnd other simi- lar sources. Second, once we have found such a source, how do we extract the data from that source. For a web service, this is not an issue, but for a web site with a form-based interface, the source might simply return an HTML page from which the data needs to be extracted. Third, given the syntactic structure of a source (i.e., the inputs and out- puts), what are the semantics of the inputs and outputs of that source. Fourth, given the inputs and outputs, what is the function that maps the inputs to the outputs. The core components that make an end-to-end solution to this problem possible have been developed in previous work. Lerman and Plangrasopchok [15] showed that so- cial bookmarking sites, such as del.icio.us can be used to identify sources similar to a given source. For example, given a geocoder, which maps street addresses to its latitude and longitude coordinates, the system can identify other geocoders that are available online by exploiting the key- words used to describe such sources on a social bookmark- ing web site. Gazen and Minton [5] developed an approach to automatically structure web sources without any previ- ous knowledge of the source. Lerman, Plangrasopchok, and Knoblock [9] developed an approach to semantic labeling of the online information. The system uses sources for which