Biomedical Signal Processing and Control 13 (2014) 337–344
Contents lists available at ScienceDirect
Biomedical Signal Processing and Control
jo ur nal homepage: www.elsevier.com/locate/bspc
Technical Note
Confidence masks for genome DNA copy number variations in
applications to HR-CGH array measurements
Jorge Mu˜ noz-Minjares, Jesús Cabal-Aragón, Yuriy S. Shmaliy
∗
Universidad de Guanajuato, Department of Electronics Engineering, Salamanca, 36885 Gto., Mexico
a r t i c l e i n f o
Article history:
Received 21 November 2013
Received in revised form 21 May 2014
Accepted 19 June 2014
Keywords:
Genome copy number variations
Confidence limit masks
HR-CGH microarray
a b s t r a c t
The array-comparative genomic hybridization (aCGH) and next generation sequence technologies enable
cost-efficient high resolution detection of DNA copy number variations (CNVs). However, while the CNVs
estimates provided by different methods are often inconsistent with each other, still a little can be found
about the estimation errors. Based on our recent studies of the confidence limits for stepwise signals
measured in noise, we develop an efficient algorithm for computing the confidence upper and lower
boundary masks in order to guarantee an existence of genomic changes with required probability. We
suggest combining these masks with estimates in order to give medical experts more information about
true CNVs structures. Applications given for high-resolution CGH microarray measurements ensure that
there is a probability that some changes predicted by an estimator may not exist.
© 2014 Elsevier Ltd. All rights reserved.
1. Introduction
The deoxyribonucleic acid (DNA) of a genome is commonly
recognized to be essential for human life. The DNA is usually
double-stranded, therefore the size of a gene or chromosome is
often measured in base pairs. A unit of measurement is kilobase (kb)
equal to 1000 bp of DNA [1]. The DNA can demonstrate structural
changes called copy-number variations (CNVs) associated with dis-
ease such as cancer [2]. The copy number variation (CNV) can be
defined as a DNA segment of one kb or larger that is present at
a variable copy number in comparison with a reference genome
[3]. The human genome with 23 chromosomes is estimated to be
about 3.2 billion base pairs long and to contain 20,000–25,000 dis-
tinct genes [2]. It is known [4] that each copy number variation
(CNV) may range from about 1 kb to several megabases (Mbs) in
size. To detect the CNVs at a resolution level of 10–25 kbs [5], the
array-comparative genomic hybridization (aCGH) technique was
developed employing chromosomal microarray analysis. Although
it was reported that the high-resolution CGH (HR-CGH) arrays are
accurate to detect structural variations at resolution of 5 kbs [6]
and even 200 bp [7], their regular resolution is still insufficient to
detect short genomic changes. A progress was achieved in the last
few years following the emergence of the next generation sequence
(NGS) technologies which allows for the detection of CNVs with
∗
Corresponding author. Tel.: +52 464 647 01 95; fax: +52 464 647 24 00.
E-mail addresses: shmaliy@ugto.mx, shmaliy@ieee.org (Y.S. Shmaliy).
resolution < 10 kbp [8]. The NGS approach has generated extensive
developments of the CNVs detection methods [9–12] and sev-
eral such methods obtaining resolution of 0.8–6 kb were recently
reviewed in [13].
Even though the conceptual steps in aCGH and CNV-seq meth-
ods are different, the outputs are typically represented in the same
scales [9]. The genomic location is often given with the nth probes,
n ∈ [1, M], where M is the number of probes following with a unit
step ignoring “bad” or empty probes. The n
l
th discrete point corre-
sponds to the i
l
th edge (breakpoint) in the genomic location scale
in kb or Mb which is finally used to represent the CNVs. The break-
points are placed as 0 < n
1
< · · · < n
L
< M, where n
l
, l ∈ [1, L], is the lth
breakpoint and L is the number of the breakpoints. The CNVs are
represented with L + 1 segmental constant changes a
j
, j ∈ [1, L + 1],
characterizing a segment between i
j-1
and i
j
on an interval [i
j-1
,
i
j
- 1]. Typically, the CNVs are normalized as log
2
R/G = log
2
Ratio,
where R and G are the fluorescent Red and Green intensities, respec-
tively [14].
Genome CNVs are stepwise sparse with a limited number of
breakpoints [15]. The detected structure is usually contaminated
by intensive noise [16]. In the log
2
Ratio scale, noise is commonly
modeled as white Gaussian with equal or different segmental vari-
ances. Various statistical approaches have been developed in order
to estimate CNVs in array CGH and NGS data. Two main goals are
[17]: (1) to infer the number and statistical significance of the
alterations and (2) to locate their boundaries accurately. A num-
ber of statistical methods were tested by CNVs measurements as
observed in [18], including wavelet-based, robust, adaptive kernel
http://dx.doi.org/10.1016/j.bspc.2014.06.006
1746-8094/© 2014 Elsevier Ltd. All rights reserved.