Decomposition and Abstraction of Web Applications for Web Service Extraction and Composition Michiaki Tatsubori IBM Tokyo Research Laboratory mich@acm.org Kenichi Takahashi Tokyo Denki University kenichi@se.sie.dendai.ac.jp Abstract There are large demands for re-engineering human- oriented Web application systems for use as machine- oriented Web application systems, which are called Web Services. This paper describes a framework named H2W, which can be used for constructing Web Service wrappers from existing, multi-paged Web applications. H2W's contribution is mainly for service extraction, rather than for the widely studied problem of data ex- traction. For the framework, we propose a page-transi- tion-based decomposition model and a page access ab- straction model with context propagation. With the pro- posed decomposition and abstraction, developers can flexibly compose a Web Service wrapper of their intent by describing a simple workflow program incorporating the advantages of previous work on Web data extrac- tion. We show three successful wrapper application ex- amples with H2W for real world Web applications. 1. Introduction Developing Web Services (machine-oriented Web- based services) from scratch is often a costly task, espe- cially when the published service is mostly processed au- tomatically. This is partly because of costly require- ments for sufficient consistency and security for the ser- vice. Such requirements are usually appropriate for many kinds of organizations such as private enterprises or government entities. For example, a shopping service should not accept a request with a negative number for the quantity of items to buy, while a university course management service should not be broken into using in- valid parameter values constructed by students or attack- ers. In order to satisfy minimal requirements for consis- tency and security, constructing and publishing Web Ser- vices by “wrapping” existing Web applications (human- oriented Web-based services) can be a reasonable solu- tion candidate. A typical wrapping technique is to pro- vide a proxy server that serves the Web Service by ac- cessing the original Web application. With this ap- proach, the minimal requirements for consistency and security are naturally satisfied because the existing Web applications have already been well tested and are known to run consistently and securely. For example, a shopping Web Service constructed upon an existing shopping Web application can detect errors in a request with an invalid quantity value for the items to be bought, at least at the level of the underlying Web application that was already developed to check for this. Moreover, the wrapper approach may be applied for using external Web applications as Web Services. However, the existing high-level support for wrapper construction is not suitable for adapting practical Web applications. There are two essential challenges for such software in adapting human-oriented services to ma- chine-oriented services with wrappers. They are: • extracting the logical data for the machines from data decorated with HTML for human readers, and • extracting a noninteractive service for machines from interactive services scattered over multi- ple webpages for humans. According to Myllimaki [11], extracting structured data from websites requires solving five distinct prob- lems: navigation, data extraction, structure synthesis, data mapping, and data integration. However, in order to clarify the contribution of this paper, we use a rougher categorization. The problems of data extraction, struc- ture synthesis and data mapping are mapped to the for- mer, logical data extraction category in our categoriza- tion. The problems of navigation and data integration are mapped to the latter, service extraction category. A large amount of work has been done addressing the former challenge of extracting logical data for machines. The research topic is often called Web data extraction or Web content extraction. For example, RoadRunner [6] automates wrapper generation and the data extraction process based on similarities and differences between HTML pages. Extensive survey papers in this area [9, 10] are available. The conclusion of these surveys is that no single extraction approach handles all of the applica- ble Web data formats. XML-based techniques [11, 8]