Jie Huang is Research Professor at the Department of Global Health, School of Public Health, Peking University, Beijing, China. His main research interest is in Bioinformatics and population genomics. He has previously published extensively in the field of phasing and imputation. The first chapter of his PhD thesis is about imputation of the UK10K data, which was also published as a Nature Communications paper in 2015. Stefano Pallotti is Post-Doc fellow at the Genetics and Animal Breeding Group, School of Pharmacy, University of Camerino, Italy. His current research focuses on molecular determinants of hair follicle cycle on fibre-producing animals. Qianling Zhou is Assistant Professor at the Department of Maternal and Child Health, School of Public Health, Peking University, Beijing, China. Her main interest is using mixed methods strategies (a combination of quantitative and qualitative research) to investigate Maternal and Child Nutrition, Health Education and Promotion. Marcus Kleber is Research Scientist at the Vth Department of Medicine, Medical Faculty of Mannheim, University of Heidelberg, Mannheim, Germany and at SYNLAB MVZ Humangenetik Mannheim, Mannheim, Germany. His main interest is the genomic research of complex common traits including the associations of fats and carbohydrates with cardiovascular disease and mortality. Xiaomeng Xin is Undergraduate student at the Skidmore College, Saratoga Springs, NY, USA, applying for a graduate study in Bioinformatics. Daniel A. King is Clinical fellow at the Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA. His main interest is mutation hunting in large-scale genomic data. He was involved in an exome sequencing study of 12,000 children with rare disease and their parents, in which he developed new computational tools to identify large genetic aberrations. Valerio Napolioni is Associate Professor of Molecular Biology at the School of Biosciences and Veterinary Medicine, University of Camerino, Camerino, Italy. His main research interest is the genetic architecture of human neuropsychiatric complex traits. Submitted: 22 July 2020; Received (in revised form): 11 October 2020 © The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 1 Briefings in Bioinformatics, 00(00), 2020, 1–13 doi: 10.1093/bib/bbaa320 Problem Solving Protocol PERHAPS: Paired-End short Reads-based HAPlotyping from next-generation Sequencing data Jie Huang, Stefano Pallotti, Qianling Zhou, Marcus Kleber, Xiaomeng Xin, Daniel A. King and Valerio Napolioni Corresponding authors: Valerio Napolioni, Ph.D., School of Biosciences and Veterinary Medicine, University of Camerino, Via Gentile Da Varano III, 62032, Camerino, Italy. Tel.: +39 0737403257; Fax: +39 0737636216; Email: valerio.napolioni@unicam.it; Jie Huang, M. D, M.P.H., Ph.D., Department of Global Health, Peking University School of Public Health, 38 Xueyuan Rd, Haidian District, Beijing, China. Tel.: +86 15210081889; Email: jiehuang001@pku.edu Abstract The identification of rare haplotypes may greatly expand our knowledge in the genetic architecture of both complex and monogenic traits. To this aim, we developed PERHAPS (Paired-End short Reads-based HAPlotyping from next-generation Sequencing data), a new and simple approach to directly call haplotypes from short-read, paired-end Next Generation Sequencing (NGS) data. To benchmark this method, we considered the APOE classic polymorphism ( ∗ 1/ ∗ 2/ ∗ 3/ ∗ 4), since it represents one of the best examples of functional polymorphism arising from the haplotype combination of two Single Nucleotide Polymorphisms (SNPs).We leveraged the big Whole Exome Sequencing (WES) and SNP-array data obtained from the multi-ethnic UK BioBank (UKBB, N=48,855). By applying PERHAPS, based on piecing together the paired-end reads according to their FASTQ-labels, we extracted the haplotype data, along with their frequencies and the individual diplotype. Concordance rates between WES directly called diplotypes and the ones generated through statistical pre-phasing and imputation of SNP-array data are extremely high (>99%), either when stratifying the sample by SNP-array genotyping batch or self-reported ethnic group. Hardy-Weinberg Equilibrium tests and the comparison of obtained haplotype frequencies with the ones available from the 1000 Genome Project further supported the reliability of PERHAPS. Notably, we were able to determine the existence of the rare APOE ∗ 1 haplotype in two unrelated African subjects from UKBB, supporting its presence at appreciable frequency (approximatively 0.5%) in the African Yoruba population. Despite acknowledging some technical shortcomings, PERHAPS represents a novel and simple approach that will partly overcome the limitations in direct haplotype calling from short read-based sequencing. Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa320/6025504 by guest on 08 December 2020