Comparison of different normalization assumptions for analyses of DNA methylation
data from the cancer genome
Dong Wang
a,
⁎
, 1
, Yuannv Zhang
a, 1
, Yan Huang
a, 1
, Pengfei Li
a
, Mingyue Wang
a
, Ruihong Wu
a
,
Lixin Cheng
a
, Wenjing Zhang
a
, Yujing Zhang
a
, Bin Li
a
, Chenguang Wang
a
, Zheng Guo
a, b,
⁎
a
College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
b
College of Life Science, University of Electronic Science and Technology of China, Chengdu, China
abstract article info
Article history:
Accepted 22 June 2012
Available online 4 July 2012
Keywords:
DNA methylation
Microarray
Data normalization
Cancer genome
Differential methylation
Reproducibility
Nowadays, some researchers normalized DNA methylation arrays data in order to remove the technical arti-
facts introduced by experimental differences in sample preparation, array processing and other factors. How-
ever, other researchers analyzed DNA methylation arrays without performing data normalization considering
that current normalizations for methylation data may distort real differences between normal and cancer
samples because cancer genomes may be extensively subject to hypomethylation and the total amount of
CpG methylation might differ substantially among samples. In this study, using eight datasets by Infinium
HumanMethylation27 assay, we systemically analyzed the global distribution of DNA methylation changes
in cancer compared to normal control and its effect on data normalization for selecting differentially methyl-
ated (DM) genes. We showed more differentially methylated (DM) genes could be found in the Quantile/Lowess‐
normalized data than in the non-normalized data. We found the DM genes additionally selected in the
Quantile/Lowess‐normalized data showed significantly consistent methylation states in another indepen-
dent dataset for the same cancer, indicating these extra DM genes were effective biological signals related
to the disease. These results suggested normalization can increase the power of detecting DM genes in the
context of diagnostic markers which were usually characterized by relatively large effect sizes. Besides, we
evaluated the reproducibility of DM discoveries for a particular cancer type, and we found most of the DM
genes additionally detected in one dataset showed the same methylation directions in the other dataset for
the same cancer type, indicating that these DM genes were effective biological signals in the other dataset.
Furthermore, we showed that some DM genes detected from different studies for a particular cancer type
were significantly reproducible at the functional level.
© 2012 Elsevier B.V. All rights reserved.
1. Introduction
DNA methylation, as one of the major epigenetic mechanisms of DNA
modification, plays a significant role in gene expression regulation, organ-
ism development, X chromosome inactivation and genetic imprinting in
vertebrates (Laird, 2003). Changes in methylation patterns and levels
have been shown to be associated with various diseases including cancers
(Chari et al., 2010; Ko et al., 2010; Teschendorff et al., 2010; Walker et al.,
2011). Genome-wide methylation arrays have been used to identify hun-
dreds of genes with differential DNA methylation in their promoters in
various types of cancers (Bediaga et al., 2010; Lugthart et al., 2011;
Melnikov et al., 2009; Taylor et al., 2007), hereafter referred to as DM
genes, providing insights into cancer biology and useful biomarkers
for predicting cancer outcomes and drug targets (Lugthart et al.,
2011). However, DNA methylation data are subject to multiple sources
of technical artifacts as a result of experimental differences in sample
preparation, array processing and other factors (Leek et al., 2010;
Quackenbush, 2002). To remove the experimental variation introduced
by these technical artifacts, normalization algorithms, such as the Qua-
ntile (Bolstad et al., 2003) and Lowess (Quackenbush, 2002) algorithms
initially developed for gene expression microarray analyses, have often
been used directly in methylation array analyses. Lowess normalization
is a method used to normalize a two-color array gene expression
dataset to compensate for non-linear dye-bias (Quackenbush, 2002).
The Lowess smoothing based on ‘MA-plot’ divides the data domain
into narrow intervals using a sliding window approach. Quantile normal-
ization makes the distribution of probe intensities for each array in a set
of arrays the same by taking the mean quantile and substituting it as the
value of the data item in the original dataset (Bolstad et al., 2003). The
Gene 506 (2012) 36–42
Abbreviations: DM, differentially methylated; DE, differentially expressed; SNR, signal
to noise ratio; GEO, Gene Expression Omnibus; TCGA, The Cancer Genome Atlas;
U, unmethylated; M, methylated; POG, percentage of overlapping genes; PPI, protein–protein
interaction; HPRD, Human Protein Reference Database; FDR, false discovery rate.
⁎ Corresponding authors at: College of Bioinformatics Science and Technology,
Harbin Medical University, Harbin 150086, China. Tel./fax: +86 45186615933.
E-mail addresses: wangdong@ems.hrbmu.edu.cn (D. Wang),
guoz@ems.hrbmu.edu.cn (Z. Guo).
1
These authors contributed equally to this work.
0378-1119/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.gene.2012.06.075
Contents lists available at SciVerse ScienceDirect
Gene
journal homepage: www.elsevier.com/locate/gene