IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 3, March 2015. www.ijiset.com ISSN 2348 – 7968 Detection of Copy Number Variations in Genomic regions Rituparna Sinha 1 , Prianka Kundu 1 1 Information Technology Department, Heritage Institute of Technology, Kolkata, West Bengal, India Abstract Copy number variation (CNV) is a form of structural variation caused by duplications or deletions of a large DNA segment, which may have a vital impact on human health, causing many neurological diseases as well as cancer. Detection of genomic regions with copy number variation is a challenging task. Our method is based on next generation sequencing (NGS) technology’s read depth approach. We have used smith waterman algorithm for alignment of the NGS based short reads and calculated the read count per window. The read count data may suffer from mappability bias which may lead to false detection of variants. To deal with this issue, we have introduced a k- mer based normalization technique. The smoothened read count data is then transformed to its corresponding statistical measure. Finally, we have applied clustering technique to identify the regions with duplications and deletions. Our approach achieves higher accuracy than existing packages with a low FPR value as well our method is applicable to both small and large genomic datasets. Keywords: CNV, NGS, Mappability bias, Structural Variations, Z-score, K-means. 1. Introduction Copy number variation (cnv) [2] is a type of structural variation (sv) [3] in the mammalian genome. It is also called genomic variation which refers to the duplication or deletion of the DNA segment. The number of copies of genomic DNA is referred by copy number. Structural variations i.e duplications or deletions can affect the DNA which is larger than 1kbp. Although there are other forms of structural variations but copy number variation is the major contributor of various diseases such as autism, schizophrenia, alzheimer disease, cancer, etc [8,9]. In earlier days fluorescence in situ hybridization (FISH) [10] and array comparative genomic hybridization (aCGH) [1] were used to detect copy number variations (cnv). However these methods suffer from certain limitations i.e., they could not identify regions with translocations and inversions (another form of structural variations). The reason is that this is only chromosomal rearrangements and do not result any loss or gain of the genetic material. In addition to this array CGH [1] could not detect single base pair changes in DNA. Furthermore, these techniques can be affected by noise due to cross -hybridization between probe and target sequences. Now a days many methods of high throughput sequencing have been discovered named as next generation sequencing(NGS) [11,12]. NGS technology brought a revolutionary change in the field of life science. Next generation sequencing(NGS) [11,12] are also known as the high throughput sequencing methods that include the concept of massively parallel sequencing of millions of DNA fragments in a single sequencing run. Next, we discuss some of the existing methods to detect structural variations. SegSeq [4] is a method which compares the tumor samples to the reference to detect copy number aberrations. SegSeq includes segmenting the chromosome. i.e it partitioned the genome into fixed size windows. The ratio between tumor sample read counts and reference read counts are observed for each and every window. If the ratio is greater than 1.5 then segments are called gains and if the ratio is below 0.5 then segments are called loss. However, in window based approaches window size is fixed and should be defined first. This can be a major disadvantage because the performance of the algorithm can be affected by the size of the window. rSW- seq [5] is another method which is based on the modification of smith waterman algorithm that is why it is called recursive smith waterman-seq. The read ratio i.e (total number of tumor reads)/(total number of normal reads) is observed. Copy number alterations is there where the read ratio varies from the expected read ratio. It uses the concept of moving window to generate the read ratios along the genome. Here the window size does not need to be specified previously. This is the main advantage of rSW-seq over seg-seq. Another copy number alterations (CNA) detection method is CMDS [6] i.e correlation matrix diagonal segmentation., 390