EXPERIENCES REGARDING AUTOMATIC DATA EXTRACTION FROM WEB PAGES Mirel Coşulschi, Bogdan Udrescu, Nicolae Constantinescu and Mihai Gabroveanu Dept. of Computer Science, University of Craiova, Romania mirelc, nikyc, mihaiug@central.ucv.ro, udrescu_bogdan@yahoo.com Adrian Giurcă Dept. of Internet Technology, Brandenburg Technical University Cottbus, Germany giurca@tu-cottbus.de ABSTRACT Existing methods of information extraction from HTML documents include manual approach, supervised learning and automatic techniques. The manual method has high precision and recall values but it is difficult to apply it for large number of pages. Supervised learning involves human interaction to create positive and negative samples. Automatic techniques benefit from less human effort but they are not highly reliable regarding the information retrieved. Our experiments align in the area of this last type of methods for this purpose developing a tool for automatic data extraction from HTML pages. KEYWORDS data extraction, wrapper, clustering 1. INTRODUCTION The vast amount of information on World Wide Web cannot be fully exploited due to its main characteristics: web pages are designed with respect to the human readers, who interact with the systems by browsing HTML pages, rather than to be used by a computer program. The semantic content structure of web pages is the principal element exploited by many web applications: one of the latest directions is the construction of wrappers in order to structure web data using regular languages and database techniques. A wrapper is a program whose scope is to extract data from various web sources. In order to accomplish this task the wrapper must identify data of interest and put them into some suitable formats, and eventually store back into a relational database. [14] states the problem of generating a wrapper for Web data extraction as follows: Given a web page S containing a set of input objects, determines a mapping W that populates a data repository R with the objects in S. The mapping W must also be capable of recognizing and extracting data from other page similar with S. Any possible solution to the above problem must take into consideration at least the following two contexts: 1. A HTML page may contain many types of information presented in different forms, such as text, image or Java applets (programs written in Java and executed, better said interpreted). The Hyper Text Markup Language (HTML) is a language designed for data presentation, and was not intended as a mean of structuring information and easing the process of structured data extraction. Another problem of HTML pages is related to their bad construction, language standards frequently being broken (i.e. improper closed tags, wrong nested tags, bad parameters and incorrect parameter values). Nevertheless today, there is a consistent effort of Word Wide Web Consortium 1 (W3C) is the W3C's first Recommendation for XHTML 1 http://www.w3c.org IADIS International Conference WWW/Internet 2006 281