Exploration of a Data Landscape using a Collaborative Linked Data Framework Laurent Alquier Janssen Pharmaceutical Companies of Johnson & Johnson 1000 route 202 Raritan NJ 08869 +1 908 218-6800 lalquier@its.jnj.com Tim Schultz Janssen Pharmaceutical Companies of Johnson & Johnson 920 US Route 202 Raritan NJ 08869 +1 908 927-6812 tschult4@its.jnj.com Susie Stephens Janssen Pharmaceutical Companies of Johnson & Johnson 145 King of Prussia Road Radnor PA 19087 +1 610 651-6206 sstephens1@its.jnj.com ABSTRACT Finding the most relevant data sources for answering translational research questions represents a significant challenge in a global and highly decentralized research organization. This challenge is only likely to increase as more data becomes available from external collaborations. This paper presents an approach to enabling scientists to find data source of interest, and to then query and visualize the data. The solution consists of knowIT, a semantic wiki that provides a foundation for capturing explicit knowledge about sources of data in a collaborative fashion. As the metadata about the data sources is exposed as Resource Description Framework (RDF), knowIT is able to provide knowledge of the biomedical resources to other applications as part of a Linked Data framework. Illustrations are provided as to how the solution can be used for search and visualization of the biomedical data landscape. Categories and Subject Descriptors K.4.3 [Computer and Society]: Organizational Impacts – computer supported cooperative work. H.5.3 [Information Systems]: Group and Organization Interfaces – computer- supported cooperative work, Organizational design, Web based interaction. General Terms Management, Human Factors. Keywords Wikis, semantic web, user experiences, usability, intranets, internal communication, collaboration organizational memory, repositories, knowledge management, knowledge transfer. 1. INTRODUCTION Translational research requires the dissemination and integration of data spanning drug discovery to clinical practice in a flexible and timely manner. Clinical data is required in discovery to help ensure that research is relevant to humans, while preclinical data helps influence the design of patient studies. It is important to share research results and knowledge throughout the enterprise to enable enhanced decision-making. Improved sharing of data will increase the likelihood of poor programs failing early where costs are relatively low, and to maximize the probability of success of expensive late stage clinical programs, which in turn will lead to the faster developments of better drugs and at lower cost. Finding sources of data with content of interest inside a large organization is challenging. Typically scientists will only become aware of a data source through word of mouth. And once that data source has been identified, scientists would need to find the owner of the data in order to gain access to it, which may require an in depth discussion of the terms and conditions of the license agreement, and the installation of new software tools on their desktop. If a scientist was interested in an external data source they would need to persuade IT to bring a copy of the data in-house and provide an interface, and discuss the licensing terms and conditions with the legal department. It is expected that these challenges will become more prevalent as pharmaceutical companies increasingly embrace external innovation and many data sources become accessible from external collaborators. At times, scientists are interested in querying or mining an individual source of data, but an equally relevant scenario is when they desire to know everything about a particular entity. The latter case is challenging as the information originates across many silos of data. Data warehouses have commonly been created to help provide scientists with such an integrative view of data. However, this approach is facing challenges, as so many different sources of data need to be integrated for translational research. Other approaches have involved mapping all data to ontologies, but this has proven to be a heavy weight approach when multiple domains are involved. As a result, more flexible approaches that focus on semantic metadata have emerged [1] [2]. Semantic wikis provide a platform for flexible and effective knowledge sharing by giving structure to wiki pages and turning them into a collaborative database [3][4][5]. The benefits of semantic wikis in life sciences have previously been described [6] with examples such as NeuroLex [7], SNPedia [8], WikiProtein [9] and WikiNeuron [10]. Our approach consists of taking advantage of the flexibility of a semantic wiki to capture metadata about data sources in order to enable their discovery, and to support data retrieval and query as part of a Linked Data framework. The solution builds upon a Semantic Media Wiki (SMW) implementation called knowIT [11]. The system has to be easy to use as both scientists and IT professions help to populate the site with metadata about data sources. However, it is equally important to expose that metadata programmatically to other systems for both search and query. 2. COLLABORATIVE ANNOTATION OF SCIENTIFIC DATA SOURCES As a collection of extensions to MediaWiki, SMW associates semantic properties to pages and stores these properties in MediaWiki’s database. One of the major strengths of this Copyright is held by the author/owner(s). FWCS 2010, April 26, 2010, Raleigh, USA.