Combining Classifiers to Identify Online Databases Luciano Barbosa School of Computing University of Utah lbarbosa@cs.utah.edu Juliana Freire School of Computing University of Utah juliana@cs.utah.edu ABSTRACT We address the problem of identifying the domain of on- line databases. More precisely, given a set F of Web forms automatically gathered by a focused crawler and an online database domain D, our goal is to select from F only the forms that are entry points to databases in D. Having a set of Web forms that serve as entry points to similar on- line databases is a requirement for many applications and techniques that aim to extract and integrate hidden-Web information, such as meta-searchers, online database direc- tories, hidden-Web crawlers, and form-schema matching and merging. We propose a new strategy that automatically and accu- rately classifies online databases based on features that can be easily extracted from Web forms. By judiciously parti- tioning the space of form features, this strategy allows the use of simpler classifiers that can be constructed using learn- ing techniques that are better suited for the features of each partition. Experiments using real Web data in a representa- tive set of domains show that the use of different classifiers leads to high accuracy, precision and recall. This indicates that our modular classifier composition provides an effective and scalable solution for classifying online databases. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Selection process. General Terms Algorithms, Design, Experimentation. Keywords Hidden Web, learning classifiers, hierarchical classifiers, on- line database directories, Web crawlers. 1. INTRODUCTION Due to the explosion in the number of online databases, there has been increased interest in leveraging the high- quality information present in these databases [2, 3, 11, 23, 33]. However, finding the right databases can be very chal- lenging. For example, if a biologist needs to locate databases related to molecular biology and searches on Google for the keywords “molecular biology database” over 27 million doc- uments are returned. Among these, she will find pages that contain databases, but the results also include a very large Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. number of pages from journals, scientific articles, personal Web pages, etc. Recognizing the need for better mechanisms to locate on- line databases, people have started to create online database collections such as the Molecular Biology Database Collec- tion [15], which lists databases of value to biologists. This collection, has been manually created and is maintained by the National Library of Medicine. Since there are several million online databases [23], manual approaches to this problem are not practical. Besides, since new databases are constantly being added, the freshness of a manually main- tained collection is greatly compromised. In this paper, we describe a new approach to the problem of identifying online databases that belong to a given do- main. There are a number of issues that make this problem particularly challenging. Since online databases are sparsely distributed on the Web, an efficient strategy is needed to lo- cate the forms that serve as entry points to these databases. In addition, online databases do not publish their schemas and their contents are hard to retrieve. Thus, a scalable solution must determine the relevance of a form to a given database domain by examining information that can be au- tomatically extracted from the forms and in their vicinity. Web crawlers can be used to locate online databases [3, 9, 10, 13, 26, 29]. However, even a focused crawler invari- ably retrieves a diverse set of forms. Consider for example, the Form-Focused Crawler (FFC) [3] which is optimized for locating searchable Web forms. For a set of representative database domains, on average, only 16% of the forms re- trieved by the FFC are actually relevant—for some domains this percentage can be as low as 6.5%. These numbers are even lower for less focused crawlers, e.g., crawlers that fo- cus only on a topic [9, 10, 13]. The problem is that a focus topic (or concept) may encompass pages that contain many different database domains. For example, while crawling to find airfare search interfaces the FFC also retrieves a large number of forms for rental car and hotel reservation, since these are often co-located with airfare search interfaces in travel sites. The set of retrieved forms also includes many non-searchable forms that do not represent database queries such as forms for login, mailing list subscriptions, and Web- based email forms. Having a homogeneous set of forms that lead to databases in the same domain is useful, and sometimes required, for a number of applications. For example, whereas for construct- ing online database directories, such as BrightPlanet [8] and the Molecular Biology Database Collection [15], it is desir- able that only relevant databases are listed, the effectiveness WWW 2007 / Track: Search Session: Crawlers 431