Finding novel associations across domains using linked data: a case study on genetic variants disrupting transcription start sites Eelke van der Horst* 1 , Rajaram Kaliyaperumal* 1 , Zuotian Tatum 1 , Mark Thompson 1 , Erik Schultes 1,2 , Eleni Mina 1 , Ivo F.A.C. Fokkema 1 , Johan T. den Dunnen 1 , Marco Roos 1 , Jeroen F.J. Laros 1 , Barend Mons 1 , Kristina M. Hettne ##1 , Peter A.C. ‘t Hoen ##1 1 Leiden University Medical Center, Leiden, the Netherlands 2 Leiden Institute of Advanced Computer Science, Leiden, the Netherlands eelkevanderhorst@gmail.com, r.kaliyaperumal@lumc.nl, ztatum@live.com, {m.thompson, E.A.Schultes, E.Mina, I.F.A.C.Fokkema}@lumc.nl, ddunnen@humgen.nl, {m.roos, J.F.J.Laros, B.Mons, k.m.hettne, P.A.C._t_Hoen}@lumc.nl ## shared last author Abstract. With the widespread use of Next Generation Sequencing technologies, the primary bottleneck of genetic research has shifted from data production to data analysis. However, heterogeneous data sets makes comparisons and integration challenging and time consuming. Here, we apply a data interoperability approach that provides unambiguous (machine readable) description of genomic annotations based on nanopublications. We show that novel associations can be discovered by performing queries that span genomic annotations from the FANTOM5 consortium (catalogue of transcription start site(s) (TSS)) and an instance of the Leiden Open Variation Database containing population-based variant frequency data. A correlation was established between the tissue specificity of a TSS and the frequency and size of genetic variants in the genomic region covering the TSS: TSS that are used in many tissues are less tolerant to disruption by insertions and deletions than those specific to just a few tissues. Keywords: Linked Data, Nanopublication, Transcription start sites, Genetic variants 1 Introduction Next Generation Sequencing (NGS) technologies produce huge amounts of data on genetic variation between organisms and between individuals. NGS data are reported and annotated in different formats that reflect the provenance and use case of the data.