Haplotype-based prediction of gene alleles using pedigrees and SNP genotypes Yuri Pirola Gianluca Della Vedova Paola Bonizzoni DISCo, Univ. degli Studi di Milano–Bicocca Viale Sarca 336, Milan, Italy {pirola,dellavedova,bonizzoni}@disco.unimib.it Alessandra Stella Filippo Biscarini CeRSA, Parco Tecnologico Padano Loc. Cascina Codazza, Lodi, Italy {alessandra.stella,filippo.biscarini}@tecnoparco.org ABSTRACT Computational methods for gene allele prediction have been proposed to substitute dedicated and expensive assays with cheaper in-silico analyses that operate on routinely collected data, such as SNP genotypes. Most of these methods are tailored to the needs and characteristics of human genetic studies where they achieve good prediction accuracy. How- ever, genomic analyses are becoming increasingly important in livestock species too. For livestock species generally the underlying—usually quite large and complex—pedigree is known and available; this information is not fully exploited by current allele prediction methods. In this paper, we propose a new gene allele prediction method based on a simple, but robust, combinatorial formulation for the problem of discovering haplotype-allele associations. The inherent uncertainty of the haplotype inference process is reduced by taking into account the inheritance of gene alleles across the population pedigree while genotypes are phased. The accuracy of the method has been extensively evaluated on a representative real-world livestock dataset under several scenarios and choices of parameters. The me- dian error rate ranged from 0.0537 to 0.0896, with an av- erage of 0.0678; this is 21% better than another state-of- the-art prediction algorithm that does not use the pedigree information. The experimental results support the validity of the proposed approach and, in particular, of the use of pedigree information in gene allele predictions. Categories and Subject Descriptors F.2.2 [Analysis of algorithms and problem complex- ity]: Nonnumerical algorithms and problems—Computations on discrete structures ; J.3 [Life and medical sciences]: Biology and genetics Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. BCB ’13, September 22 - 25, 2013, Washington, DC, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2434-2/13/09 ...$15.00. http://dx.doi.org/10.1145/2506583.2506592 General Terms Algorithms, Experimentation Keywords Computational biology, genotypes, SNP, pedigree, haplo- type 1. INTRODUCTION In recent years, the advent of Next-Generation Sequencing methods, which produce enormous volumes of short DNA sequences, coupled with significant bioinformatics advance- ments, has made de-novo genome sequencing a task that can be completed in a few months. Therefore, the genome of many organisms has been recently sequenced. Not only humans, mice and micro-organisms, but also plants and ani- mals of agricultural interest. For example, the genomes of all major livestock species, such as the cow [6] and the pig [7], are now available, and more are underway. The availability of a reference genome facilitates the char- acterization of genomic loci, such as Single Nucleotide Poly- morphisms (SNPs), where genetic variability among individ- uals of the same species is mostly concentrated. Nowadays, high-density commercial chips allow to investigate tens of thousands of SNPs and, thanks to the progresses of bio- chemical technologies, their price is constantly decreasing: for example, genotyping a cow for 50k SNPs can now cost as little as 60/70 euros. Also very high density SNP chips (800k, for cattle) are now available, and whole-genome se- quences (millions of SNPs) are soon to come. This wealth of genomic information has been finding prac- tical applications in many fields of research such as, for in- stance, genome-wide association studies (GWAS) for disease risk in humans [16], genome-based prediction of reproductive values in farm animals and plants (“genomic selection”) [11], or in the investigation of relationships between populations and their evolutionary history [8]. One of such applications is the prediction of gene alle- les from marker genotypes. In humans, for instance, alleles of the HLA (Human Leukocyte Antigen) complex [14] play an important role in the evaluation of organ transplanta- tion compatibility. Besides humans, gene allele prediction from marker genotypes is an important task in livestock ACM-BCB 2013 33