Integrating EMR-Linked and In Vivo Functional Genetic Data to Identify New Genotype-Phenotype Associations Jonathan D. Mosley 1 , Sara L. Van Driest 2 , Peter E. Weeke 1 , Jessica T. Delaney 1 , Quinn S. Wells 1 , Lisa Bastarache 3 , Dan M. Roden 1 , Josh C. Denny 1,3 * 1 Department of Medicine, Vanderbilt University, Nashville, Tennessee, United States of America, 2 Department of Pediatrics, Vanderbilt University, Nashville, Tennessee, United States of America, 3 Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, United States of America Abstract The coupling of electronic medical records (EMR) with genetic data has created the potential for implementing reverse genetic approaches in humans, whereby the function of a gene is inferred from the shared pattern of morbidity among homozygotes of a genetic variant. We explored the feasibility of this approach to identify phenotypes associated with low frequency variants using Vanderbilt’s EMR-based BioVU resource. We analyzed 1,658 low frequency non-synonymous SNPs (nsSNPs) with a minor allele frequency (MAF),10% collected on 8,546 subjects. For each nsSNP, we identified diagnoses shared by at least 2 minor allele homozygotes and with an association p,0.05. The diagnoses were reviewed by a clinician to ascertain whether they may share a common mechanistic basis. While a number of biologically compelling clinical patterns of association were observed, the frequency of these associations was identical to that observed using genotype- permuted data sets, indicating that the associations were likely due to chance. To refine our analysis associations, we then restricted the analysis to 711 nsSNPs in genes with phenotypes in the On-line Mendelian Inheritance in Man (OMIM) or knock-out mouse phenotype databases. An initial comparison of the EMR diagnoses to the known in vivo functions of the gene identified 25 candidate nsSNPs, 19 of which had significant genotype-phenotype associations when tested using matched controls. Twleve of the 19 nsSNPs associations were confirmed by a detailed record review. Four of 12 nsSNP- phenotype associations were successfully replicated in an independent data set: thrombosis (F5,rs6031), seizures/ convulsions (GPR98,rs13157270), macular degeneration (CNGB3,rs3735972), and GI bleeding (HGFAC,rs16844401). These analyses demonstrate the feasibility and challenges of using reverse genetics approaches to identify novel gene-phenotype associations in human subjects using low frequency variants. As increasing amounts of rare variant data are generated from modern genotyping and sequence platforms, model organism data may be an important tool to enable discovery. Citation: Mosley JD, Van Driest SL, Weeke PE, Delaney JT, Wells QS, et al. (2014) Integrating EMR-Linked and In Vivo Functional Genetic Data to Identify New Genotype-Phenotype Associations. PLoS ONE 9(6): e100322. doi:10.1371/journal.pone.0100322 Editor: Joseph Devaney, Children’s National Medical Center, Washington, United States of America Received February 17, 2014; Accepted May 25, 2014; Published June 20, 2014 Copyright: ß 2014 Mosley et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by the Vanderbilt University Medical Center Clinical Pharmacology Training grant (T32 GM07569), the Vanderbilt site of the electronic MEdical Records and GEnomics (eMERGE) Network U01-HG006378, R01-LM-01685, and an ARRA grant RC2 GM092618, and the Vanderbilt CTSA grant UL1 TR000445 from National Center for Advancing Translational Sciences/National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * Email: joshua.denny@vanderbilt.edu Introduction Electronic medical record (EMR) systems store an increasing amount of clinical, laboratory and biometric data generated by health care systems. These data offer opportunities to explore risk factors for diseases, the inter-relationships among disease entities, and determinants of treatment response in large populations of individuals [1]. EMR data integrated with DNA repositories can also be utilized to identify genetic contributions to human disease risk and treatment response [2–7]. The spectrum of disease entities collected in EMRs has also enabled large-scale bioinformatics approaches such as Phenome-Wide Association Study (PheWAS), which searches in a disease-agnostic fashion for associations between common polymorphisms and hundreds of clinical diseases, identified using billing codes [8,9]. The success of PheWAS approaches for common variants suggests that similar EMR-based approaches may identify associations with low frequency or rare variants [4,10,11]. Experimental model systems such as mouse models have been successful in assigning functionality to genes through the use of reverse genetics approaches, which identify phenotypes associated with a known genetic lesion [12,13]. Structured data derived from mouse studies are increasingly available through large coordinated efforts such as the Knock-out Mouse Project (KOMP) [14] and the Mouse Phenome Database [15]. These data sources provide a rich resource for generating biologically-relevant clinical hypotheses based on observations of model organisms that can now be tested in a real life setting using large EMRs coupled with DNA repositories, such as the Vanderbilt BioVU resource [16]. Rare and low frequency single nucleotide polymorphisms (SNPs) are appealing candidates to explain much of the variation in human traits that cannot be accounted for by common polymorphisms [17]. However, associating rare variants to disease represents a considerable methodological challenge and remains an area of active research [18,19]. From an epidemiological standpoint, low frequency variants are of particular interest PLOS ONE | www.plosone.org 1 June 2014 | Volume 9 | Issue 6 | e100322