Proposed Foundations for Evaluating Data Sharing and Reuse in the Biomedical Literature Heather A. Piwowar Department of Biomedical Informatics, University of Pittsburgh 200 Meyran Avenue Pittsburgh, PA 15260 1.412.647.7113 hpiwowar@gmail.com ABSTRACT Science progresses by building upon previous research. Progress can be most rapid, efficient, and focused when raw datasets from previous studies are available for reuse. To facilitate this practice, funders and journals have begun to request and require that investigators share their primary datasets with other researchers. Unfortunately, it is difficult to evaluate the effectiveness of these policies. This study aims to develop foundations for evaluating data sharing and reuse decisions in the biomedical literature by developing tools to answer the following research questions, within the context of biomedical gene expression datasets: What is the prevalence of biomedical research data sharing? Biomedical research data reuse? What features are most associated with an investigator’s decision to share or reuse a biomedical research dataset? Does sharing or reusing data contribute to the impact of a research article, independently of other factors? What do the results suggest for developing efficient, effective policies, tools, and initiatives for promoting data sharing and reuse? I suggest a novel approach to identifying publications that share and reuse datasets, through the application of natural language processing techniques to the full text of primary research articles. Using these classifications and extracted covariates, univariate and multivariate analysis will assess which features are most important to data sharing and reuse prevalence, and also estimate the contribution that sharing data and reusing data make to a publication’s research impact. I hope the results will inform the development of effective policies and tools to facilitate this important aspect of scientific research and information exchange. Categories and Subject Descriptors H.1.1 [Systems and Information Theory]: Value of information; H.3.5 [Online Information Services]: Data sharing; J.3 [Life and Medical Sciences]: Biology and genetics, Health General Terms Measurement, Human Factors Keywords data sharing, data reuse, evaluation, policy, bioinformatics, bibliometrics 1. INTRODUCTION Sharing information facilitates science. Reusing previously- collected data in new studies allows these valuable resources to contribute far beyond their original analysis.[1] In addition to being used to confirm original results, raw data can be used to explore related or new hypotheses, particularly when combined with other publicly available data sets. Real data is indispensable when investigating and developing study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and patient population resources by avoiding duplicate data collection. Believing that that these benefits outweigh the costs of sharing research data, many initiatives actively encourage investigators to make their data available. Some journals require the submission of detailed biomedical data to publicly available databases as a condition of publication.[2,3] Since 2003, the NIH has required a data sharing plan for all large funding grants and has more recently introduced stronger requirements for genome-wide association studies[4,5]; other funders have similar policies. Several government whitepapers[1,6] and high- profile editorials[7-12] call for responsible data sharing and reuse, large-scale collaborative science is providing the opportunity to share datasets within and outside of the original research projects[13,14], and tools, standards, and databases are developed and maintained to facilitate data sharing and reuse. Despite these investments of time and money, we do not yet understand the prevalence and patterns of data sharing and reuse, the effectiveness of initiatives, or the costs, benefits, and impact of repurposing biomedical research data. The goal of this study is to build foundational tools, datasets, and analyses for identifying and evaluating data sharing and reuse decisions within the biomedical literature.