Adapting Searchy to Extract Data using Evolved Wrappers David F. Barrero a , Mar´ ıa D. R-Moreno a , David Camacho b a Universidad de Alcal´ a. Computer Engineering Department. Escuela Polit´ ecnica. Ctra. Madrid-Barcelona km 31,600 28871 Alcal de Henares, Madrid, Spain Phone: (+34) 91-885-69-20, fax: (+34) 885-66-41 b Universidad Aut´ onma de Madrid. Escuela Polit´ ecnica Superior. C/ Francisco Tom´ as y Valiente 11. Ciudad Universitaria de Cantoblanco. 28049 Madrid, Spain Phone: (+34) 91-497-22-88, fax: (+34) 91-497-22-35 Abstract Organisations need diverse information systems to deal with the increasing requirements in information storage and processing, yielding the creation of information islands and therefore an intrinsic difficulty to obtain a global view. Being able to provide such an unified view of the -likely heterogeneous- information available in an organisation is a goal that provides added-value to the information systems and has been subject of intense research. In this paper we present an extension of a solution named Searchy, an agent-based mediator system specialized in data extraction and integration. Through the use of a set of wrappers, it integrates information from arbitrary sources and semantically translates them according to a mediated scheme. Searchy is actually a domain-independent wrapper container that ease wrapper development, providing, for example, semantic mapping. The extension of Searchy proposed in this paper introduces an evolutionary wrapper that is able to evolve wrappers using regular expressions. To achieve this, a Genetic Algorithm (GA) is used to learn a regex able to extract a set of positive samples while rejects a set of negative samples. Keywords: Wrappers, Genetic Algorithms, Information Extraction 1. Introduction Organisations have to deal with increasing needs of process automation, yielding a grown of the number and size of soft- ware applications. As a result there is a fragmentation of the in- formation: it is placed in different databases, documents of dif- ferent formats or applications that hide valuable data. Thus, it originates the creation of information islands within the organi- sation. Then, it has a negative impact when users need a global view of the information, increasing the complexity and devel- opment costs of applications. Usually ad-hoc applications are developed despite its lack of generality and maintenance costs. Information Integration [1] is a research area that addresses the several problems that emerge when dealing with such scenario. When a bunch of organizations are involved in an integra- tion process, the problems associated in the integration are in- creased. Some traditional integration problems, such as infor- mation heterogeneity, are amplified and new problems such as the lack of centralized control over the information systems arises. One of the most interesting problems in such context is how to ensure administrative autonomy, i.e., limit as much as possible the constrains that the integration might impose to data Email addresses: david@aut.uah.es (David F. Barrero), mdolores@aut.uah.es (Mar´ ıa D. R-Moreno), david.camacho@uam.es (David Camacho) sources. We have developed a data integration solution called Searchy with the intention of addressing those constrains. Searchy [2] is a distributed mediator system that provides a virtual unified view of heterogeneous sources. It receives a query and maps it into one or more local queries, then trans- lates the responses from the local schema to a mediated one defined by an ontology and integrates them. It separates the in- tegration issues from the data extraction mechanism, and thus it can be seen as a wrapper container that eases wrapper de- velopment. It is based on Web Standards like RDF (Resource Description Framework) or OWL (Web Ontology Language). Then, Searchy can be easily integrated in other platforms and systems based on the Semantic Web or SOA (Service Oriented Architecture). Experience using Searchy in production environments has shown issues to be enhanced. One of the most successful wrap- pers in Searchy was the regex wrapper, a wrapper that extracts data from unstructured documents using a regular expression (or simply regex). Regex is a powerful tool able to extract strings that match a given pattern. Two problems were found related to wrapper-based regex utilization: the need of an engi- neer (or a specialized user, which we usually denoted as wrap- per engineer) with specific skills in regex programming, and the lack of automatic way to handle errors in the extraction pro- cess. These problems lead us to adapt the Searchy architecture to support evolved wrappers. That is, wrappers based on regex Preprint submitted to Expert Systems with Applications September 12, 2011