We observed that BFAST generates alignments with too many mismatches. By removing those alignments with more than two mismatches high-quality BAM files were obtained (Fig. 3). In addition, unlike GATK recommendation for whole-exome experiments, we detected that in experiments with low-medium coverage (~40x), the use of a depth filter in combination with Best Practices V3 quality filters is essential to remove a high number of false positives. We also calculated the number of called SNVs as a function of sequencing depth (Fig. 4). Although the overall number of SNVs still grows at a low rate, the number of exonic SNVs seems to plateau, indicating that an increase in sequencing depth will not significantly improve the results. The MGP is sequencing 300 control individuals. So far 118 samples have been done producing a total of 204.935 variants, 69.181 of them are new variants (not previously reported in any public repository). Interestingly, while sequencing new individuals the number of new variants is still strongly growing at an almost constant rate. We predict to obtain more than 110000 new unreported variants with 300 control individuals (Fig. 5). The use of control variants for filtering population variations allows to significantly reduce the number of selected genes/variants reported by the analysis pipelines. Abstract Next Generation Sequencing (NGS) Technologies have greatly improved our ability to mine variants out of the entire genome. The reliability of calling variants is highly related to the sequencing instrument used due to the sequencing chemistry and the intrinsic properties of each sequencing technology. Here, we focus on variants detected from color-space sequences generated by AB SOLiD 5500 XL sequencers. Thus, we systematically analyzed 120 human exomes from Spanish population, identifying the main drivers of bias derived by SOLiD colorspace data and, in turn, optimizing an analysis pipeline to obtain high-quality variants. Methodology Results The Medical Genome Project (MGP) aims to characterize a large number of rare genetically-based diseases. We selected from the MGP a set of affected individuals by several hereditary rare diseases, their healthy relatives and a set of control healthy individuals from Spanish population. Fig. 2 shows mean coverage per sample. Sequences after QC filtering were mapped with BFAST generating BAM files with a mean coverage higher than 40x (Fig. 2). Mendelian filter of deleterious variants xsq file generated by Applied Biosystem SOLiD 5500 XL sequencer Detecting high quality variants from color-space data Francisco J. López 1 , Antonio Rueda 1 , Javier Pérez-Florido 1 , Pablo Arce 1 , Luis-Miguel Cruz 1 , José Carbonell 3 , Jorge Jiménez- Almazán 3 , Enrique Vidal 3 , Guillermo Antiñolo 1,2 , Joaquin Dopazo 3 and Javier Santoyo 1 1 Genomics and Bioinformatics Platform of Andalusia (GBPA), Medical Genome Project (MGP), INSUR building, Albert Einstein st., Cartuja 93 Scientific and Technology Park, 41092 Seville, Spain 2 Unidad de gestión clínica de genética, reproducción y medicina fetal. Instituto de Biomedicina de Sevilla (IBIS), Hospital Universitario Virgen del Rocío- CSIC-University of Seville, Manuel Siurot Av., 41013, Seville, Spain 3 Institute of Computational Genomics, Principe Felipe Research Centre (CIPF), Eduardo Primo Yúfera st. 3, 46012, Valencia, Spain javier.santoyo@juntadeandalucia.es The analysis of a variety of rare diseases allowed us to observe genes which are selected as potential disease-causing agents regardless of the disease being analyzed. In order to shed some light on this observation, we looked at the number of variants found for each gene as well as the number of samples in which it appears mutated (Fig. 6). Interestingly, a number of genes present many different variants and appear mutated in many genes. The identification of these genes may help in the prioritization of selected genes by removing false candidate genes. In addition, a number of genes host few variants but appear mutated in most of the samples, which may be an indicator of population variants. Sanger sequencing validation has shown that more than 90% of the reported variants by our pipeline are real variations. Moreover, insertions and deletions which are usually difficult to analyze are clearly detected. For example, in a family affected by a recessive disease, a heterozygous deletion in a X-linked recessive inheritance gene is detected in the healthy mother and in the X chromosome of the two affected sons (Fig. 7). Conclusions The system that we have developed for color-space data provides high-quality variants with an extremely low rate of false positives. Critical aspects to achieve such good performance are: i. BAM filtering, since an unexpected number of mismatches are allowed by BFAST for short reads mapping in color space. ii. The selection of variant filters and quality thresholds as recommended by GATK Best Practices V3 in combination with a depth threshold allowing high quality calls iii.The inclusion of control individuals in the analysis, which is critical since they remove population variants which can disturb the interpretation of the final variant set Fasta and Qual files generation Familywise analysis: cases vs. healthy filter of haplotypes • Variant annotation (ANNOVAR) • Variant function impact prediction (SIFT and Polyphen) • Assessment of variant frequency (1000 genomes and dbSNP DBs) Duplicated reads removal Read mapping (BFAST v0.7.0a) BAM cleaning: duplicated alignments and mismatched reads removal BAM realignment and SNV calling (GATK v1.4.14) Variant quality filter (GATK best practices V3) and depth filter (6x) Candidate annotated variants related to the disease Primary analysis Secondary analysis Fig. 1. Pipeline Fig. 2. Mean coverage per sample. The dashed line indicates 40x Fig. 3. Screenshot from Non-filtered BAM (up) vs. filtered BAM (down) Fig. 4. Number of discovered SNVs (total, exonic and others) as a function of the number of mapped reads. Fig. 5. The graph shows how the number of (total and new) variants grows in our database as the number of control individuals is incremented Fig. 6. Each point represents a gene. X axis represents the number of samples in which the gene is mutated. Y axis represents the number of different variants found in all samples in that gene. Fig. 7. BAM files screenshot for a given region in chromosome X. Healthy (but disease carrier) mother, healthy father and two affected sons Mother Father Son Son se. F1000 Posters mmons License. F1000 Posters: Use Permitte r Creative Commons License. F1000 Posters: Use Permitted under Crea ermitted under Creative Commons License. F1000 Posters: Use Permitted under Creative Common osters: Use Permitted under Creative Commons License. F1000 Posters: Use Permitted under Creative Permitted under Creative Commons License. F1000 Posters: Use Permitted un der Creative Commons License. F1000 Posters: Use Commons License. F1000