Comparison of different normalization assumptions for analyses of DNA methylation data from the cancer genome Dong Wang a, ⁎ , 1 , Yuannv Zhang a, 1 , Yan Huang a, 1 , Pengfei Li a , Mingyue Wang a , Ruihong Wu a , Lixin Cheng a , Wenjing Zhang a , Yujing Zhang a , Bin Li a , Chenguang Wang a , Zheng Guo a, b, ⁎ a College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China b College of Life Science, University of Electronic Science and Technology of China, Chengdu, China abstract article info Article history: Accepted 22 June 2012 Available online 4 July 2012 Keywords: DNA methylation Microarray Data normalization Cancer genome Differential methylation Reproducibility Nowadays, some researchers normalized DNA methylation arrays data in order to remove the technical arti- facts introduced by experimental differences in sample preparation, array processing and other factors. How- ever, other researchers analyzed DNA methylation arrays without performing data normalization considering that current normalizations for methylation data may distort real differences between normal and cancer samples because cancer genomes may be extensively subject to hypomethylation and the total amount of CpG methylation might differ substantially among samples. In this study, using eight datasets by Inﬁnium HumanMethylation27 assay, we systemically analyzed the global distribution of DNA methylation changes in cancer compared to normal control and its effect on data normalization for selecting differentially methyl- ated (DM) genes. We showed more differentially methylated (DM) genes could be found in the Quantile/Lowess‐ normalized data than in the non-normalized data. We found the DM genes additionally selected in the Quantile/Lowess‐normalized data showed signiﬁcantly consistent methylation states in another indepen- dent dataset for the same cancer, indicating these extra DM genes were effective biological signals related to the disease. These results suggested normalization can increase the power of detecting DM genes in the context of diagnostic markers which were usually characterized by relatively large effect sizes. Besides, we evaluated the reproducibility of DM discoveries for a particular cancer type, and we found most of the DM genes additionally detected in one dataset showed the same methylation directions in the other dataset for the same cancer type, indicating that these DM genes were effective biological signals in the other dataset. Furthermore, we showed that some DM genes detected from different studies for a particular cancer type were signiﬁcantly reproducible at the functional level. © 2012 Elsevier B.V. All rights reserved. 1. Introduction DNA methylation, as one of the major epigenetic mechanisms of DNA modiﬁcation, plays a signiﬁcant role in gene expression regulation, organ- ism development, X chromosome inactivation and genetic imprinting in vertebrates (Laird, 2003). Changes in methylation patterns and levels have been shown to be associated with various diseases including cancers (Chari et al., 2010; Ko et al., 2010; Teschendorff et al., 2010; Walker et al., 2011). Genome-wide methylation arrays have been used to identify hun- dreds of genes with differential DNA methylation in their promoters in various types of cancers (Bediaga et al., 2010; Lugthart et al., 2011; Melnikov et al., 2009; Taylor et al., 2007), hereafter referred to as DM genes, providing insights into cancer biology and useful biomarkers for predicting cancer outcomes and drug targets (Lugthart et al., 2011). However, DNA methylation data are subject to multiple sources of technical artifacts as a result of experimental differences in sample preparation, array processing and other factors (Leek et al., 2010; Quackenbush, 2002). To remove the experimental variation introduced by these technical artifacts, normalization algorithms, such as the Qua- ntile (Bolstad et al., 2003) and Lowess (Quackenbush, 2002) algorithms initially developed for gene expression microarray analyses, have often been used directly in methylation array analyses. Lowess normalization is a method used to normalize a two-color array gene expression dataset to compensate for non-linear dye-bias (Quackenbush, 2002). The Lowess smoothing based on ‘MA-plot’ divides the data domain into narrow intervals using a sliding window approach. Quantile normal- ization makes the distribution of probe intensities for each array in a set of arrays the same by taking the mean quantile and substituting it as the value of the data item in the original dataset (Bolstad et al., 2003). The Gene 506 (2012) 36–42 Abbreviations: DM, differentially methylated; DE, differentially expressed; SNR, signal to noise ratio; GEO, Gene Expression Omnibus; TCGA, The Cancer Genome Atlas; U, unmethylated; M, methylated; POG, percentage of overlapping genes; PPI, protein–protein interaction; HPRD, Human Protein Reference Database; FDR, false discovery rate. ⁎ Corresponding authors at: College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150086, China. Tel./fax: +86 45186615933. E-mail addresses: wangdong@ems.hrbmu.edu.cn (D. Wang), guoz@ems.hrbmu.edu.cn (Z. Guo). 1 These authors contributed equally to this work. 0378-1119/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2012.06.075 Contents lists available at SciVerse ScienceDirect Gene journal homepage: www.elsevier.com/locate/gene