Supplemental Materials to ”Enhanced copy number variants detection from whole-exome sequencing data using EXCAVATOR2” Romina D’Aurizio 1 , Tommaso Pippucci 2 , Lorenzo Tattini 3 , Betti Giusti 3 , Marco Pellegrini 1 , Alberto Magi 3 1 LISM, Institute of Informatics and Telematics and Institute of Clinical Physiology, National Research Council, Pisa 2 Medical Genetics Unit, Sant’Orsola Malpighi Polyclinic, Bologna 3 Department of Computer Science, University of Pisa, Pisa and 4 Department of Experimental and Clinical Medicine, University of Florence, Florence August 8, 2016 Supplemental Methods and Data Here we extended the RC approach (Magi et al., 2010) to all the genome for the identification of CNVs from WES data. We proved, indeed, that around 30% on average of reads produced by WES experiments align outside the targeted regions. As for In-target regions, also for Off-target if the sequencing process is uniform then the number of reads mapping to each genomic region is expected to be proportional to the number of times the region appears in the DNA sample. Following this assumption, the copy number of any genomic region can be estimated by counting the number of reads aligned to non-overlapping and contiguous genomic windows of predefined size L. Even though the sequencing processes producing In- and Off-target reads are independent we can apply the RC approach also to Off-target regions. To this end we expanded the Exon Mean Read Count (EMRC) that we introduced in 2013 (Magi et al., 2013) defining the Window Mean Read Count (WMRC) which account for both In-target exons and Off-target windows of fixed size. 1 GC-content and Mappability normalisation of WMRC In our previous work (Magi et al., 2013) we demonstrated that EMRC data are affected by three sources of biases: the total GC content, the genomic mappability and the exon size which we removed using the median normalisation approach introduced by Yoon in 2009 (Yoon et al., 2009) for GC content and extended by us for mappability and exon size biases. Here we proved that WMRC data in Off-target windows are affected by similar GC content and mappability biases that we corrected according to the following formula: WMRC w = WMRC w m m X (1) where WMRC w is the number of reads aligned to a genomic region of length W . W varies according to the size of each targeted exon or, in case of Off-target, it corresponds to the selected fixed size of non-overlapping windows in which the intergenic chromosome is divided; m X is the median WMRC of all the windows that have the same X value (where X stands for GC content, mappability score and, only for In-target, exon size) as the i-th window, and m is the overall median of all the windows. 1