SNPMiner: A Domain-Speciﬁc Deep Web Mining Tool Fan Wang * Gagan Agrawal * Ruoming Jin † Helen Piontkivska δ * Department of Computer Science and Engineering Ohio State University, Columbus OH 43210 {wangfa,agrawal}@cse.ohio-state.edu † Department of Computer Science Kent State University, Kent OH 44242 {jin}@cs.kent.edu δ Department of Biological Sciences Kent State University, Kent OH 44242 {opiontki}@kent.edu Abstract—In this paper, we propose a novel query-oriented, mediator-based biological data querying tool, SNPMiner. The system searches and queries Single Nucleotide Polymorphisms (SNPs) data from eight widely used web accessible databases. The system provides a domain-speciﬁc search utility, which can access and collect data from the deep web. This is a web-based system, so any user can use the system by accessing our server from their own computers. The system includes three important components, which are the web server interface, the dynamic query planner, and the web page parser. The web server interface can provide end users a uniﬁed and friendly interface. The dynamic query planner can automatically schedule an efﬁcient query order on all available databases according to user’s query request. The web page parser analyzes the layout of HTML ﬁles and extracts desired data from those ﬁles. The ﬁnal results of the query are organized in a tabular format, which can be reviewed by a biological researcher. I. I NTRODUCTION Recent advance in genomic technology has been greatly im- pacting the practice of biological and medical research. A large volume of sequencing and structural data is being made available for further analysis. Further, with the fast development of internet and web, most of the biological data and information is now available for web-access. However, efﬁcient and effective use of growing number of such data sources is becoming a critical problem for biological and medical researchers. Traditionally, biologists can manually query on all available databases and combine information gathered from these hetero- geneous data sources. Unfortunately, the sheer volume and rapid growth of biological data makes this process time-consuming, tedious and error-prone. The biggest challenge here is how to effectively integrate all these heterogeneous databases together and query on them to extract relevant information. In the past few years, there has been a number of efforts focused on the integration of biological data sources. The key challenges that most of these systems need to address are: variety of data types, representational heterogeneity, autonomous and web-based sources, and differing querying capabilities [1]. In terms of integration approaches, the existing research can be largely divided into three categories, which are warehourse inte- gration, mediator-based integration and navigational integration. One of the newer challenges in biological data integration, which has not been addressed by the above efforts, is associated with the emergence of deep-web data sources. Many of the data sources are stored in online databases, hidden behind the query forms, forming the deep web [2]. As compared to the surface web, where the HTML pages are static and data is stored as document ﬁles, deep web data is stored in databases. Dynamic HTML pages are generated only after a user submits a query by ﬁlling an online form. Thus, standard search engines like Google are not able to crawl to these web-sites. At the same time, manually submitting online queries to numerous query forms, keeping track of the obtained results, and combining them together is a tedious and error-prone process. Thus, there exists a growing need for a tool that can extract meaningful and relevant information from deep web biology data sources with minimum human intervention. This paper presents such a tool for deep-web mining. While the underlying techniques are general, the speciﬁc implementation is driven by the problem of searching SNP databases. A. Speciﬁc Motivating Problem: Searching SNP Databases In the effort to explain the genetic contribution to complex diseases such as cancer and heart disease, Single Nucleotide Polymorphisms (SNPs), that designate sites in the genome that has two or more nucleotide variants segregating in a population [3], seem particularly promising because they are usually biallelic and thus easily assayed [4], [5], [6]. Because over seven million Single Nucleotide Polymorphisms (SNPs) have been reported in public databases, it is desirable to develop methods of sifting through this information for ﬁnding likely candidates for disease association. Furthermore, information on human SNPs is also useful for studying questions related to human evolutionary history [7] and the role of population genetic processes such as natural selection in shaping the human genome [8], [9], [10]. There are a number of publicly available databases that provide information on SNPs in humans. At the same time, biomedical