X-tract: Structure extraction from botanical textual descriptions Rocío Abascal and J. Alfredo Sánchez CENTIA 1 Laboratory of Interactive and Cooperative Technologies Universidad de las Américas – Puebla Cholula, Pue. 72820 México {abascal,alfredo}@mail.udlap.mx 1 Center for Research in Information and Automation Technologies Abstract Most available information today, both from printed books and digital repositories, is in the form of free- format texts. The task of retrieving information from these ever-growing repositories has become a challenge for information retrieval (IR) researchers. In some fields, such as Botany and Taxonomy, textual descriptions observe a set of rules and use a relatively limited vocabulary. This makes botanical textual descriptions an interesting area to explore IR techniques for finding structure and facilitating semantic analysis. This paper presents X-tract, a solution to the problem of text analysis and structure extraction in a specific application domain, namely floristic morphologic descriptions. The solution demonstrates the potential of using a grammar in the determination of information structure in a botanical digital library. We have developed a prototype based on this approach in which given an HTML or plain text, X-tract analyzes it and presents results to the user so he or she can verify the proposed structure before updating the database. This transformation is useful also in the process of storing morphologic descriptions i n a database with a preestablished format. The solution is implemented in the context of the Floristic Digital Library (FDL), a large digital library project comprising a wide variety of botanical documents, formats and services. Subject areas: information extraction, X-tract, botanical digital libraries, FDL 1 INTRODUCTION Digital libraries continue to perform important functions such as collecting, organizing, presenting and finding information. They also extend the services that are provided by conventional libraries by taking advantage of the digital media [Lesk 1997]. One of the digital libraries currently being constructed, the Floristic Digital Library (FDL), is a virtual distributed space comprising botanical information and a variety of services offered to users to facilitate the use and extension of knowledge about plants [Schnase et al 1997]. Several international research and development projects financed by the National Scientific Foundation (NSF), like the Flora of North America (FNA), the Flora of China (FOC) and the Flora Mesoamericana (FM) participate in the FDL. The main objective of this project is to create a digital library with information about plants from various geographical areas. For example, FNA manages information of approximately 20,000 species of vascular plants and bryophytes of North America north of Mexico [Schnase et al. 1997]. This library will contain textual documents, maps, illustrations and will provide services for the general public and for over 800 scientists who are contributing to this project. One of the major problems faced by projects such as FNA and FOC relates to the fact that most of the information managed does not follow any specific format. However, botanical descriptions do regularly adhere to generally accepted rules and are based on a relatively limited vocabulary. The FDL is developing an object-relational model to store botanical descriptions. We therefore need to extract information that is available in non-structured documents so that it can be incorporated into the FDL's database. Among other resulting benefits, on-line information can be presented in a uniform format, and information can be produced in many formats for its distribution in paper or via web. 1.1 The problem of information extraction The collections maintained by a library represent the individual efforts of thousands of authors, working together and separately over hundreds or even thousands of years and using a tremendous range of composition tools to capture their thoughts [Furuta 1994]. The proliferation of on-line text motivates most current work in text interpretation. Although massive volumes of information are available at low cost in digital free text form, people cannot read and digest the information any faster than before; in fact, for the most part they can digest even less. Information extraction (IE) systems analyze unrestricted text in order to extract specific types of information [Lehnert 1996]. IE systems do not attempt to understand all of the text in all input documents, but they do analyze those portions of each document that contain relevant information. Relevance is determined by pre-defined domain guidelines which must specify, as accurately as possible, exactly what types of information the system is expected to find. The problem of extracting information from data is not addressed by simply developing better classification schemes, organizing data collections using newer and better database schemata, nor simply making the data accessible to the entire world by