On-the-ﬂy Integration and ad hoc Querying of Life Sciences Databases using LifeDB ⋆ Anupam Bhattacharjee ♯ , Aminul Islam ♯ , Mohammad Shafkat Amin ♯ , Shahriyar Hossain ♯ , Shazzad Hosain ♯ , Hasan Jamil ♯ , Leonard Lipovich ♭ ♯ Department of Computer Science, Wayne State University, USA ♭ Center for Molecular Medicine and Genetics, Wayne State University, USA {anupam, aminul, shafkat, shah h, shazzad, hmjamil, llipovich}@wayne.edu Abstract. Data intensive applications in Life Sciences extensively use the Hidden Web as a platform for information sharing. Access to these heterogeneous Hidden Web resources is limited through the use of prede- ﬁned web forms and interactive interfaces that users navigate manually, and assume responsibility for reconciling schema heterogeneity, mediat- ing missing information, extracting information and piping, transformat- ing formats and so on in order to implement desired query sequences or scientiﬁc work ﬂows. In this paper, we present a new data management system, called LifeDB, in which we oﬀer support for currency without view materialization and autonomous reconciliation of schema hetero- geneity in one single platform through a declarative query language called BioFlow. In our approach, schema heterogeneity is resolved at run time by treating the hidden web resources as a virtual warehouse, and by sup- porting a set of primitives for data integration on-the-ﬂy, for extracting information and piping to other resources, and for manipulating data in a way similar to traditional database systems to respond to application demands. We also describe BioFlow’s support for work ﬂow design and application design using a visual interface called VizBuilder. 1 Introduction Data and application integration in Life Sciences play an important and essen- tial role. In traditional approaches, data and tools for interpreting them from multiple sources are warehoused in local machines, and applications are designed around these resources by manually resolving any existing schema heterogeneity. This approach is reliable, and works well when the application’s resource need, or the data sources do not change often, requiring partial or full overhauling. The disadvantage is that the warehouse must be synchronized constantly with the sources to stay current leading to huge maintenance overhead. The alterna- tive has been to write applications by dedicated communication with the data sources, again manually mediating the schema. While this approach removes the physical downloading of the source contents and buys currency, it still requires ⋆ Research supported in part by National Science Foundation grants CNS 0521454 and IIS 0612203, and National Institutes of Health NIDA grant 1R03DA026021-01.