Autoregressive Higher-Order Hidden Markov Models: Exploiting Local Chromosomal Dependencies in the Analysis of Tumor Expression Profiles Michael Seifert 1 *, Khalil Abou-El-Ardat 2 , Betty Friedrich 1 , Barbara Klink 2 , Andreas Deutsch 1 1 Center for Information Services and High Performance Computing, Dresden University of Technology, Dresden, Germany, 2 Institute for Clinical Genetics, Faculty of Medicine Carl Gustav Carus, Dresden University of Technology, Dresden, Germany Abstract Changes in gene expression programs play a central role in cancer. Chromosomal aberrations such as deletions, duplications and translocations of DNA segments can lead to highly significant positive correlations of gene expression levels of neighboring genes. This should be utilized to improve the analysis of tumor expression profiles. Here, we develop a novel model class of autoregressive higher-order Hidden Markov Models (HMMs) that carefully exploit local data-dependent chromosomal dependencies to improve the identification of differentially expressed genes in tumor. Autoregressive higher- order HMMs overcome generally existing limitations of standard first-order HMMs in the modeling of dependencies between genes in close chromosomal proximity by the simultaneous usage of higher-order state-transitions and autoregressive emissions as novel model features. We apply autoregressive higher-order HMMs to the analysis of breast cancer and glioma gene expression data and perform in-depth model evaluation studies. We find that autoregressive higher-order HMMs clearly improve the identification of overexpressed genes with underlying gene copy number duplications in breast cancer in comparison to mixture models, standard first- and higher-order HMMs, and other related methods. The performance benefit is attributed to the simultaneous usage of higher-order state-transitions in combination with autoregressive emissions. This benefit could not be reached by using each of these two features independently. We also find that autoregressive higher-order HMMs are better able to identify differentially expressed genes in tumors independent of the underlying gene copy number status in comparison to the majority of related methods. This is further supported by the identification of well-known and of previously unreported hotspots of differential expression in glioblastomas demonstrating the efficacy of autoregressive higher-order HMMs for the analysis of individual tumor expression profiles. Moreover, we reveal interesting novel details of systematic alterations of gene expression levels in known cancer signaling pathways distinguishing oligodendrogliomas, astrocytomas and glioblastomas. An implementation is available under www.jstacs.de/index.php/ARHMM. Citation: Seifert M, Abou-El-Ardat K, Friedrich B, Klink B, Deutsch A (2014) Autoregressive Higher-Order Hidden Markov Models: Exploiting Local Chromosomal Dependencies in the Analysis of Tumor Expression Profiles. PLoS ONE 9(6): e100295. doi:10.1371/journal.pone.0100295 Editor: Joseph Najbauer, University of Pe ´cs Medical School, Hungary Received April 17, 2014; Accepted May 22, 2014; Published June 23, 2014 Copyright: ß 2014 Seifert et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. Pollack breast cancer data is available from Pollack JR, et al. (2002), PNAS 99: 12963–12968. Rembrandt glioma data is available from the Repository for Molecular Brain Neoplasia Data (Rembrandt, release 1.5.9, https://caintegrator.nci.nih.gov/rembrandt/). Tayrac glioma data is available from GEO (GSE10878). Cancer signaling pathway data is available from ConsensusPathDB (http://cpdb.molgen.mpg.de/). An implementation of ARHMMs and considered gene expression data sets are available from http://www.jstacs. de/index.php/ARHMM. Funding: This work was done in the frame of the GlioMath-Dresden project funded by the European Social Fund and the Free State of Saxony. We acknowledge support by the German Research Foundation and the Open Access Publication Funds of the TU Dresden. Competing Interests: The authors have declared that no competing interests exist. * Email: michael.seifert@zih.tu-dresden.de Introduction Copy number changes of genes are frequently found in different types of cancer [1]. Mutations such as duplications of oncogenes and deletions of tumor suppressor genes contribute together with single nucleotide polymorphisms, epigenetic alterations and other types of mutations to changes in gene expression programs triggering the development of cancer [2]. Broad and focal duplications and deletions of chromosomal regions are known to directly influence expression levels of underlying genes. Genes with increased copy numbers tend to show increased expression, whereas genes with reduced copy numbers tend to show reduced expression in tumors compared to healthy tissue (e.g. [3–6]). This coupling of gene copy numbers and gene expression levels leads to local chromosomal dependencies between gene expression levels providing the opportunity to develop improved methods for the analysis of individual tumor expression profiles. Over the last years, several approaches have been developed for the analysis of tumor expression profiles in the context of chromosomal locations of genes. Methods like CGMA (compar- ative genomic microarray analysis) [7], MACAT (MicroArray Chromosome Analysis Tool) [8] or LAP (Locally Adaptive statistical Procedure) [9] require replicated measurements of tumor and normal reference samples for the identification of differentially expressed genes. Such methods cannot be applied to the analysis of individual tumor expression profiles in large screenings for which repeated profiling of the same sample is PLOS ONE | www.plosone.org 1 June 2014 | Volume 9 | Issue 6 | e100295