FiVaTech: Page-Level Web Data Extraction from Template Pages Mohammed Kayed Department of Computer Science and Information Engineering, National Central University, Taiwan kayed@db.csie.ncu.edu.tw Khaled Shaalan Institute of Informatics, the British University in Dubai, United Arab Emirates khaled.shaalan@buid.ac.ae Chia-Hui Chang Department of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw Moheb Ramzy Girgis Department of Computer Science, Minia University, El-Minia, Egypt mrgirgis@mailer.eun.eg Abstract In this paper, we proposed a new approach, called FiVaTech for the problem of Web data extraction. FiVaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. FiVaTech uses tree templates to model the generation of dynamic Web pages. FiVaTech can deduce the schema and templates for each individual Deep Web site, which contains either singleton or multiple data records in one Web page. FiVaTech applies tree matching, tree alignment, and mining techniques to achieve the challenging task. The experiments show an encouraging result for the test pages used in many state-of-the-art Web data extraction works. 1. Introduction Deep Web, as is known to everyone, contains magnitudes more and valuable information than the surface Web. However, making use of such consolidated information requires substantial efforts since the pages are generated for visualization not for data exchange. Thus, extracting information from Web pages for searchable Web sites has been a key step for Web information integration. Generating an extraction program for a given search form is equivalent to wrapping a data source such that all extractor or wrapper programs return data of the same format for information integration. An important characteristic of pages belonging to the same site is that such pages share the same template since they are generated with a predefined template by plugging data values. The extraction targets of these pages are almost equal to the data values embedded during page generation. Thus, there is no need to annotate the Web pages for extraction targets as in non-template page information extraction (e.g. Softmealy [3]) and the key to automatic extraction depends on whether we can deduce the template automatically. Finding that template requires multiple pages (e.g. EXALG [1]) or a single page containing multiple records as input (e.g. DEPTA [9]). In this paper, we focus on page-level extraction tasks and propose a new approach, called FiVaTech, to automatically detect the schema of a Web site. The rest of the paper is organized as follows. Section 2 defines the data extraction problem. Section 3 provides the system framework as well as the detail algorithm of FiVaTech. Section 4 gives the detail of template and schema deduction. Section 5 describes our experiments. Finally, section 6 concludes our work. 2. Problem formulation In this section, we formulate the model for page creation which describes how data is embedded using a template. As we know, a Web page is created by embedding a data instance x (taken from the database) into a predefined template. Usually a CGI program executes the encoding function which combines a data instance with the template to form the Web page, where all data instances of the database conform to a common schema which can be defined as follows. Definition 2.1:(Structured data) A data schema can be of the following types. 1. A basic type (β) represents a string of tokens