Heterogeneous Web Data Extraction using Ontology Hicham Snoussi Centre de recherche informatique de Montréal 550 Sherbrooke Street West, Montréal, Canada H3A 1B9 hsnoussi@crim.ca Laurent Magnin Centre de recherche informatique de Montréal 550 Sherbrooke Street West, Montréal, Canada H3A 1B9 lmagnin@crim.ca Jian-Yun Nie Université de Montréal C.P. 6128, succ. Centre-ville Montreal, H3C 3J7 Canada nie@iro.umontreal.ca ABSTRACT Multi-agent systems can be fully developed only when they have access to a large number of information sources. These latter are becoming more and more available on the Internet in form of web pages. This paper does not deal with the problem of information retrieval, but rather the extraction of data from HTML web pages in order to make them usable by autonomous agents. This problem is not trivial because of the heterogeneity of web pages. We describe our approach to facilitate the formalization, extraction and grouping of data from different sources. To do this, we developed a utility tool that assists us in generating a uniform description for each information source, using descriptive domain ontology. Users and agents can query the extracted data using a standard querying interface. The ultimate goal of this tool is to provide useful information to autonomous agents. Keywords: data extraction, WEB, ontology, agent, XML. 1. INTRODUCTION The Internet contains a huge number of information sources of different kinds. Even if a user has the possibility of browsing the Internet, the search of relevant information is still a difficult task. Most search engines use keywords to identify possible answers to a user's query, and return a list of links to the documents. Many of the returned documents are not relevant to the query, and the user often has to browse the returned document list to find relevant information. Although a search engine provides useful help for users to identify relevant information, it cannot be used by a software agent to obtain reliable data to fulfil its tasks. This is mainly due to the lack of precision and standard formalism of the returned answers. In addition, current search engines are more focussed on static data on the web rather than dynamic data that constantly change, such as weather forecast, stock exchange information, etc. Such data are more and more required by software agents, and the need is growing to find ways to extract data so they can be fully exploited by agents. Our goal is to develop a method to extract reliable data from web pages that intelligent agents can use. 2. PROBLEM DESCRIPTION AND SUMMARY OF OUR APPROACH Data on the Web are usually included in HTML pages, and they do not correspond to pre-established schema. While a human user can understand the data in a page, it is impossible to do it by a machine. Therefore, extracting data from web pages for agents requires knowledge on both the structure and the contents of the web pages. There have been mainly two approaches to deal with this problem in data extraction from web pages: • The first approach relies on a natural language processing (NLP). It is known that current NLP is not accurate and powerful enough to recognize the contents of unrestricted web pages. Therefore, this approach has been used in some limited domains; • The second approach tries to associate a web page with some semantic markers (or tags) when it is created. For example, one may use personalized markers. The limitations of such an approach are well known: the markers are personalized, they can be hardly generalized [2]. As the original data are structured in different ways, it is necessary to restructure them according to a common model that is independent from the information sources. Our approach is based on this idea. In particular, we will focus on data extraction from semi-structured web pages that present constantly changing data, but with a fixed structure (e.g. stock exchange quotes). Our approach makes use of ontology to model the data to be extracted. The data in a web page is first converted into XML, then mapped with the data model. The definition of the data model and the mapping are done manually. Then an automatic process is carried out to perform the real extraction task. The final result is an XML document that contains a standardized and queriable data set.