Jie Huang is Research Professor at the Department of Global Health, School of Public Health, Peking University, Beijing, China. His main research interest is
in Bioinformatics and population genomics. He has previously published extensively in the field of phasing and imputation. The first chapter of his PhD
thesis is about imputation of the UK10K data, which was also published as a Nature Communications paper in 2015.
Stefano Pallotti is Post-Doc fellow at the Genetics and Animal Breeding Group, School of Pharmacy, University of Camerino, Italy. His current research
focuses on molecular determinants of hair follicle cycle on fibre-producing animals.
Qianling Zhou is Assistant Professor at the Department of Maternal and Child Health, School of Public Health, Peking University, Beijing, China. Her main
interest is using mixed methods strategies (a combination of quantitative and qualitative research) to investigate Maternal and Child Nutrition, Health
Education and Promotion.
Marcus Kleber is Research Scientist at the Vth Department of Medicine, Medical Faculty of Mannheim, University of Heidelberg, Mannheim, Germany
and at SYNLAB MVZ Humangenetik Mannheim, Mannheim, Germany. His main interest is the genomic research of complex common traits including the
associations of fats and carbohydrates with cardiovascular disease and mortality.
Xiaomeng Xin is Undergraduate student at the Skidmore College, Saratoga Springs, NY, USA, applying for a graduate study in Bioinformatics.
Daniel A. King is Clinical fellow at the Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA. His main interest is mutation
hunting in large-scale genomic data. He was involved in an exome sequencing study of 12,000 children with rare disease and their parents, in which he
developed new computational tools to identify large genetic aberrations.
Valerio Napolioni is Associate Professor of Molecular Biology at the School of Biosciences and Veterinary Medicine, University of Camerino, Camerino, Italy.
His main research interest is the genetic architecture of human neuropsychiatric complex traits.
Submitted: 22 July 2020; Received (in revised form): 11 October 2020
© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
1
Briefings in Bioinformatics, 00(00), 2020, 1–13
doi: 10.1093/bib/bbaa320
Problem Solving Protocol
PERHAPS: Paired-End short Reads-based HAPlotyping
from next-generation Sequencing data
Jie Huang, Stefano Pallotti, Qianling Zhou, Marcus Kleber, Xiaomeng Xin,
Daniel A. King and Valerio Napolioni
Corresponding authors: Valerio Napolioni, Ph.D., School of Biosciences and Veterinary Medicine, University of Camerino, Via Gentile Da Varano III, 62032,
Camerino, Italy. Tel.: +39 0737403257; Fax: +39 0737636216; Email: valerio.napolioni@unicam.it; Jie Huang, M. D, M.P.H., Ph.D., Department of Global
Health, Peking University School of Public Health, 38 Xueyuan Rd, Haidian District, Beijing, China. Tel.: +86 15210081889; Email: jiehuang001@pku.edu
Abstract
The identification of rare haplotypes may greatly expand our knowledge in the genetic architecture of both complex and
monogenic traits. To this aim, we developed PERHAPS (Paired-End short Reads-based HAPlotyping from next-generation
Sequencing data), a new and simple approach to directly call haplotypes from short-read, paired-end Next Generation
Sequencing (NGS) data. To benchmark this method, we considered the APOE classic polymorphism (
∗
1/
∗
2/
∗
3/
∗
4), since it
represents one of the best examples of functional polymorphism arising from the haplotype combination of two Single
Nucleotide Polymorphisms (SNPs).We leveraged the big Whole Exome Sequencing (WES) and SNP-array data obtained from
the multi-ethnic UK BioBank (UKBB, N=48,855). By applying PERHAPS, based on piecing together the paired-end reads
according to their FASTQ-labels, we extracted the haplotype data, along with their frequencies and the individual diplotype.
Concordance rates between WES directly called diplotypes and the ones generated through statistical pre-phasing and
imputation of SNP-array data are extremely high (>99%), either when stratifying the sample by SNP-array genotyping batch
or self-reported ethnic group. Hardy-Weinberg Equilibrium tests and the comparison of obtained haplotype frequencies
with the ones available from the 1000 Genome Project further supported the reliability of PERHAPS. Notably, we were able to
determine the existence of the rare APOE
∗
1 haplotype in two unrelated African subjects from UKBB, supporting its presence
at appreciable frequency (approximatively 0.5%) in the African Yoruba population. Despite acknowledging some technical
shortcomings, PERHAPS represents a novel and simple approach that will partly overcome the limitations in direct
haplotype calling from short read-based sequencing.
Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa320/6025504 by guest on 08 December 2020