A disequilibrium model for detecting genetic mutations for cancer Yao Li a,c,1 , Changxing Ma d,1 , Zhong Wang a , Gang Chen a,b , Kwangmi Ahn a,b , Philip Lazarus b , Rongling Wu a,b,Ã a Center for Statistical Genetics, Pennsylvania State University, Hershey, PA 17033, USA b Penn State Cancer Institute, Pennsylvania State University, Hershey, PA 17033, USA c Department of Statistics, West Virginia University, Morgantown, WV 26506, USA d Department of Biostatistics, University at Buffalo School of Public Health and Health Professions, Buffalo, NY 14214, USA article info Article history: Received 9 July 2009 Received in revised form 7 April 2010 Accepted 20 April 2010 Available online 5 May 2010 Keywords: Somatic mutation Cancer Zygotic disequilibrium EM algorithm abstract It has been recognized that genetic mutations in specific nucleotides may give rise to cancer via the alteration of signaling pathways. Thus, the detection of those cancer-causing mutations has received considerable interest in cancer genetic research. Here, we propose a statistical model for characterizing genes that lead to cancer through point mutations using genome-wide single nucleotide polymorphism (SNP) data. The basic idea of the model is that mutated genes may be in high association with their nearby SNPs because of evolutionary forces. By genotyping SNPs in both normal and cancer cells, we formulate a polynomial likelihood to estimate the population genetic parameters related to cancer, such as allele frequencies of cancer-causing alleles, mutation rates of alleles derived from maternal or paternal parents, and zygotic linkage disequilibria between different loci after the mutation occurs. We implement the EM algorithm to estimate some of these parameters because of the missing information in the likelihood construction. The model allows the elegant tests of the significant associations between mutated cancer genes and genome-wide SNPs, thus providing a way for predicting the occurrence and formation of cancer with genetic information. The model, validated through computer simulation, may help cancer geneticists design efficient experiments and formulate hypotheses for cancer gene identification. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction By altering the behavior of cells, mutations in key regulatory genes (tumor suppressors and protooncogenes) can lead to unregulated growth that may potentially develop into cancer. So far, more than 350 mutated genes that are causally implicated in cancer development have been identified (Futreal et al., 2004). The identification of these cancer genes is mostly via using physical and genetic mapping strategies. However, since each of these strategies can only identify a subset of cancer genes, the question about how many cancer genes are totally involved in cancer development is largely unanswered. A recent study surveying the human cancer genome shows that more gene mutations may drive cancer than previously thought (Greenman et al., 2007). The advent of the human genome sequence and cancer genome sequence provides an unprecedented opportunity to reveal the full compendium of mutations in individual cancers (The Cancer Genome Atlas Research Network, 2008; Velculescu, 2008; Stratton et al., 2009) and thereby develop accurately targeted treatments using this information. A great majority of somatic mutations observed was single- base substitutions (Velculescu, 2008), although mutations may also encompass several other classes of DNA sequence change, such as insertions or deletions of small or large segments of DNAs, DNA rearrangements, copy number increases, and copy number reductions (Stratton et al., 2009). The first evidence of single-point mutation for cancer was identified in two independent experi- ments of transforming cancer DNA into normal cells (Tabin et al., 1982; Reddy et al., 1982). The transformed normal cells become cancerous due to the single base G 4T substitution that leads a glycine to valine substitution in codon 12 of the HRAS gene. This discovery brought about increasing studies of associating genetic mutations with cancer. The substitution of C:G base pairs by T:A base pairs or by G:C base pairs is correlated with colorectal cancer and breast cancer and may be explained to be the cause of these cancers (Greenman et al., 2007). In addition, the chromosomal distribution and impact of these mutations were found to differ between these two types of cancer, suggesting that the mechanisms underlying mutagenesis and repair are cancer- dependent. Somatic mutations in the genome occur through the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/yjtbi Journal of Theoretical Biology 0022-5193/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2010.04.019 Ã Corresponding author at: Penn State Cancer Institute, Pennsylvania State University, Hershey, PA 17033, USA. Tel.: +1 717 531 2037. E-mail address: RWu@hes.hmc.psu.edu (R. Wu). 1 The two authors contributed to this work equally. Journal of Theoretical Biology 265 (2010) 218–224