Biomedical Signal Processing and Control 13 (2014) 337–344 Contents lists available at ScienceDirect Biomedical Signal Processing and Control jo ur nal homepage: www.elsevier.com/locate/bspc Technical Note Conﬁdence masks for genome DNA copy number variations in applications to HR-CGH array measurements Jorge Mu˜ noz-Minjares, Jesús Cabal-Aragón, Yuriy S. Shmaliy ∗ Universidad de Guanajuato, Department of Electronics Engineering, Salamanca, 36885 Gto., Mexico a r t i c l e i n f o Article history: Received 21 November 2013 Received in revised form 21 May 2014 Accepted 19 June 2014 Keywords: Genome copy number variations Conﬁdence limit masks HR-CGH microarray a b s t r a c t The array-comparative genomic hybridization (aCGH) and next generation sequence technologies enable cost-efﬁcient high resolution detection of DNA copy number variations (CNVs). However, while the CNVs estimates provided by different methods are often inconsistent with each other, still a little can be found about the estimation errors. Based on our recent studies of the conﬁdence limits for stepwise signals measured in noise, we develop an efﬁcient algorithm for computing the conﬁdence upper and lower boundary masks in order to guarantee an existence of genomic changes with required probability. We suggest combining these masks with estimates in order to give medical experts more information about true CNVs structures. Applications given for high-resolution CGH microarray measurements ensure that there is a probability that some changes predicted by an estimator may not exist. © 2014 Elsevier Ltd. All rights reserved. 1. Introduction The deoxyribonucleic acid (DNA) of a genome is commonly recognized to be essential for human life. The DNA is usually double-stranded, therefore the size of a gene or chromosome is often measured in base pairs. A unit of measurement is kilobase (kb) equal to 1000 bp of DNA [1]. The DNA can demonstrate structural changes called copy-number variations (CNVs) associated with dis- ease such as cancer [2]. The copy number variation (CNV) can be deﬁned as a DNA segment of one kb or larger that is present at a variable copy number in comparison with a reference genome [3]. The human genome with 23 chromosomes is estimated to be about 3.2 billion base pairs long and to contain 20,000–25,000 dis- tinct genes [2]. It is known [4] that each copy number variation (CNV) may range from about 1 kb to several megabases (Mbs) in size. To detect the CNVs at a resolution level of 10–25 kbs [5], the array-comparative genomic hybridization (aCGH) technique was developed employing chromosomal microarray analysis. Although it was reported that the high-resolution CGH (HR-CGH) arrays are accurate to detect structural variations at resolution of 5 kbs [6] and even 200 bp [7], their regular resolution is still insufﬁcient to detect short genomic changes. A progress was achieved in the last few years following the emergence of the next generation sequence (NGS) technologies which allows for the detection of CNVs with ∗ Corresponding author. Tel.: +52 464 647 01 95; fax: +52 464 647 24 00. E-mail addresses: shmaliy@ugto.mx, shmaliy@ieee.org (Y.S. Shmaliy). resolution < 10 kbp [8]. The NGS approach has generated extensive developments of the CNVs detection methods [9–12] and sev- eral such methods obtaining resolution of 0.8–6 kb were recently reviewed in [13]. Even though the conceptual steps in aCGH and CNV-seq meth- ods are different, the outputs are typically represented in the same scales [9]. The genomic location is often given with the nth probes, n ∈ [1, M], where M is the number of probes following with a unit step ignoring “bad” or empty probes. The n l th discrete point corre- sponds to the i l th edge (breakpoint) in the genomic location scale in kb or Mb which is ﬁnally used to represent the CNVs. The break- points are placed as 0 < n 1 < · · · < n L < M, where n l , l ∈ [1, L], is the lth breakpoint and L is the number of the breakpoints. The CNVs are represented with L + 1 segmental constant changes a j , j ∈ [1, L + 1], characterizing a segment between i j-1 and i j on an interval [i j-1 , i j - 1]. Typically, the CNVs are normalized as log 2 R/G = log 2 Ratio, where R and G are the ﬂuorescent Red and Green intensities, respec- tively [14]. Genome CNVs are stepwise sparse with a limited number of breakpoints [15]. The detected structure is usually contaminated by intensive noise [16]. In the log 2 Ratio scale, noise is commonly modeled as white Gaussian with equal or different segmental vari- ances. Various statistical approaches have been developed in order to estimate CNVs in array CGH and NGS data. Two main goals are [17]: (1) to infer the number and statistical signiﬁcance of the alterations and (2) to locate their boundaries accurately. A num- ber of statistical methods were tested by CNVs measurements as observed in [18], including wavelet-based, robust, adaptive kernel http://dx.doi.org/10.1016/j.bspc.2014.06.006 1746-8094/© 2014 Elsevier Ltd. All rights reserved.