A Genome-Wide Analysis of Array-Based Comparative Genomic Hybridization (CGH) Data to Detect Intra- Species Variations and Evolutionary Relationships Apratim Mitra 1 , George Liu 2 , Jiuzhou Song 1 * 1 Department of Animal and Avian Sciences, University of Maryland, College Park, Maryland, United States of America, 2 Bovine Functional Genomics Lab, Animal and Natural Resources Institute, Agricultural Research Service, United States Department of Agriculture, Beltsville, Maryland, United States of America Abstract Array-based comparative genomics hybridization (aCGH) has gained prevalence as an effective technique for measuring structural variations in the genome. Copy-number variations (CNVs) form a large source of genomic structural variation, but it is not known whether phenotypic differences between intra-species groups, such as divergent human populations, or breeds of a domestic animal, can be attributed to CNVs. Several computational methods have been proposed to improve the detection of CNVs from array CGH data, but few population studies have used CGH data for identification of intra- species differences. In this paper we propose a novel method of genome-wide comparison and classification using CGH data that condenses whole genome information, aimed at quantification of intra-species variations and discovery of shared ancestry. Our strategy included smoothing CGH data using an appropriate denoising algorithm, extracting features via wavelets, quantifying the information via wavelet power spectrum and hierarchical clustering of the resultant profile. To evaluate the classification efficiency of our method, we used simulated data sets. We applied it to aCGH data from human and bovine individuals and showed that it successfully detects existing intra-specific variations with additional evolutionary implications. Citation: Mitra A, Liu G, Song J (2009) A Genome-Wide Analysis of Array-Based Comparative Genomic Hybridization (CGH) Data to Detect Intra-Species Variations and Evolutionary Relationships. PLoS ONE 4(11): e7978. doi:10.1371/journal.pone.0007978 Editor: Joy Sturtevant, Louisiana State University, United States of America Received June 15, 2009; Accepted October 13, 2009; Published November 24, 2009 Copyright: ß 2009 Mitra et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors are grateful for the support by USDA-NRI 2008-00870 and Flagship program of University of Maryland. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: songj88@umd.edu Introduction Since the largest source of known genomic variations consists of single nucleotide polymorphisms (SNPs), an extensive amount of research has been conducted for characterizing SNPs in the human genome [1,2]. A great volume of ongoing work has also involved finding the relationship between specific SNPs and human or animal disease. However, recent reviews [3] have suggested that structural variations, such as copy number variations (CNVs) and segmental duplications (SDs) may also be responsible, at least in part, for giving rise to complex disease. The advent of various high-throughput technologies such as array- based comparative genomics hybridization (CGH) has made it easier to probe such variations. Array-based comparative genomic hybridization (aCGH) is becoming a popular and cost-effective way of detecting and measuring structural variations of the genome, and could be used for phylogenetic research. Although a wide range of computa- tional methods exist [4,5,6,7], the accurate estimation of copy number from CGH data is still an open problem. Segmentation approaches attempt to partition the genome into regions of ‘gain’ or ‘loss’, while denoising methods use distributional assumptions about experimental error to smooth the signal [5,6,8,9,10]. The latter are often coupled with a thresholding step to define regions of CNV [4,7,11]. There have been recent attempts to compare existing algorithms and to quantify their performance in detection of CNVs from aCGH [12], but no efforts have been directed towards using aCGH information for a population study. It is generally accepted that the ideal phylogenetic study should use genome-wide information from a large set of individuals, but this is impossible at present because of prohibitive cost, incomplete information and intensive computing requirements. Alternative ways are to use housekeeping genes, the largest possible DNA region or concatenation of core genes, which frequently result in biased inference and systematic overestimation within-species [13,14,15]. In addition, through decades of artificial selection and natural selection as well as population divergence created by geographical isolation, individuals from the same species have differences in their DNA sequences, which include SNPs and Structural Variations (SVs), although these differences cannot fully explain existing phenotypic differences. A recent effort to quantify the effect of genetic variation on gene expression found that SNPs and CNVs contribute 83.6% and 17.7% of the total variation found [16]. Therefore, the contribution of CNVs to genetic diversity is unquestionable [17]. In this paper, we propose a wavelet-based method to quantify structural variation profiles to enable comparisons between genomes of closely related individuals, which may pave the way for the use of genome-wide structural information for a phylo- genetic study. We first use an appropriate denoising algorithm PLoS ONE | www.plosone.org 1 November 2009 | Volume 4 | Issue 11 | e7978