Odds Ratios by Disease 0.0 2.0 4.0 6.0 8.0 10.0 Musculoskeletal Diseases Nervous System Diseases Nutritional and Metabolic Diseases Prostate Cancer Breast Cancer Colon Cancer Lung Cancer Leukemia/Lymphoma All Neoplasms All Diseases Odds Ratio Odds Ratios by Area of Information Science 0.0 2.0 4.0 6.0 8.0 10.0 Image Processing, Computer-Assisted User-Computer Interface Database Management Systems Vocabulary, Controlled Computer Graphics Software Databases, Genetic Information Storage and Retrieval Computer Simulation Software Validation Databases, Protein Pattern Recognition, Automated Artificial Intelligence Odds Ratio Examining the uses of shared data Heather A. Piwowar and Douglas B. Fridsma Does your research area re-use shared datasets? • Re-using data has many benefits, including research synergy and efficient resource use • Some research areas have tools, communities, and practices which facilitate re-use • Identifying these areas will allow us to learn from them, and apply the lessons to areas which underutilize the sharing and re-purposing of scientific data between investigators Hope Although not all research topics can be addressed by re-using existing data, many can. Identifying areas with frequent re-use can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data. Future Work We plan to refine our tool for identifying studies which re-use data, and continue studying and measuring re-use and reusability. Acknowledgements We sincerely thank our funders: • USA NLM for a training grant • USA NSF for a travel grant For Further Information Please contact hpiwowar@alumni.pitt.edu Poster will be available at Nature Precedings. FIGURE 1: Documents with re-used data have a different MeSH distribution than those with original data. Which datasets? This preliminary analysis examines the re-use of microarray gene expression datasets. Thousands of microarray gene expression datasets have been deposited in publicly available databases. Many studies reuse this data, but it is not well understood for what purposes. Here, we examined all publications found in PubMed Central on April 1, 2007 whose full-text contained the phrases “microarray” and “gene expression” to find studies which re-used microarray data. How did we identify re-use? We developed prototype machine-learning classifiers to identify a) studies containing original microarray data (n=900) and b) studies which instead re-used microarray data (n=250). Preprocessing (Python NLTK) extracted manually-selected keyword frequencies from the full-text publications as features for a Support Vector Machine (SVMlite). The classifier was trained and tested on a manually-labeled set of documents (PLoS articles prior to January 2007 containing the word “microarray,” n=200). How did we identify patterns of re-use? We compared the Medical Subject Heading (MeSH) of the two classes to estimate the odds that a specific MeSH term would be used given all studies with original microarray data, compared to the odds of the same term describing studies with re-used data. Terms were truncated to comparable levels in the MeSH hierarchy. Results • Publications with original vs. re-used microarray data have different distributions of MeSH terms (Figure 1), and occur in different proportions across various journals (Figure 2). • Microarray data source (original vs. re-used) did not affect the odds of a study focusing on humans, mice, or invertebrates, whereas publications with re-used data did involve a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.5) compared to publications with original data. • Trends in odds ratios of MeSH terms for other attributes can be seen in Figure 3. FIGURE 2: As expected, journals with a bioinformatics focus published the highest proportion of studies with re-used microarray data. Percent of microarray studies with re-used data by Journal 0% 5% 10% 15% 20% 25% 30% 35% 40% Mol Cancer Environ Health Perspect Arthritis Res Ther Reprod Biol Endocrinol BMC Microbiol BMC Dev Biol BMC Biotechnol BMC Mol Biol BMC Neurosci Respir Res BMC Immunol Retrovirology Breast Cancer Res BMC Cancer BMC Genomics PLoS Genet Nucleic Acids Res BMC Evol Biol Genome Biol PLoS Biol PLoS Comput Biol BMC Bioinformatics (Number of microarray studies with re-used data)/(All microarray studies) Publication MeSH vectors in PCA space First Principal Component Second Principal Component Publications with original data Publications which re-used data Odds Ratios of MeSH term occurrence given studies with re-used vs. original data by Organism 0.0 2.0 4.0 6.0 8.0 10.0 Plants Fungi Viruses Bacteria Invertebrates A nimalPopulationGroups Rats Mice Humans A ll Chordata Odds Ratio (Odds of MeSH term given all publications with re-used microarray data over odds of MeSH term given all publications with original microarray data) Odds Ratios by Biological Phenomenon 0.0 2.0 4.0 6.0 8.0 10.0 Cell Proliferation Apoptosis Reproduction Up-Regulation Cell Differentiation Pharmacology Protein Biosynthesis Transcription, Genetic Down-Regulation Proteomics Metabolism Signal Transduction Growth/Development Mutagenesis Species Specificity Organ Specificity Phylogeny Genomics Promoter Regions Cell Cycle Environmental Sciences Computational Biology Binding Sites Evolution Odds Ratio FIGURE 3: The fold-change in odds that a specific MeSH term will describe a publication with re-used microarray data as compared to a publication with original data Odds Ratios by Biological Methods 0.0 2.0 4.0 6.0 8.0 10.0 Reverse Transcriptase Polymerase Chain Reaction Poly merase Chain Reaction Nucleic Acid Hybridization Chromosome Mapping Sequenc e Analy sis , DNA Sequenc e Alignment All Genetic Techniques Protein Interac tion Mapping Clinic al Lab Tec hniques All Non-Statistic al Inves tigativ e Tec hniques Odds Ratio Odds Ratios by Statistical Methods 0.0 2.0 4.0 6.0 8.0 10.0 Reproducibility of Results Survival Analysis Analysis of Variance Systems Integration Principal Component Analys is Cluster Analysis Bayes Theorem Sensitivity and Specificity Models Multivariate Analysis Algorithms Predictive Value of Tests Sample Size All Statistical Methods Odds Ratio University of Pittsburgh Department of Biomedical Informatics Nature Precedings : doi:10.1038/npre.2007.425.3 : Posted 18 Jul 2007