Data-driven Networking Reveals 5-Genes Signature for Early Detection of Lung Cancer Vladimir Kuznetsov Sterling Thomas Danail Bonchev Bioinformatics Institute, Singapore Virginia Commonwealth University Virginia Commonwealth University vladimirk@bii.a-star.edu.sg sthomas@vcu.edu dgbonchev@vcu.edu Abstract A new strategy is developed for a gene signature search proceeding from a biological basis in analyzing microarray databases. The procedure involves a combination of known and original methods for correlation, statistical, and network analysis. The application of the strategy to lung adenocarcinoma resulted in a 5-gene signature, which included four genes not associated earlier with lung adenocarcinoma. 93-96% accuracy of classification of cancer vs. normal was achieved. The final stage of our procedure included expanding of the gene signature network to a 43 gene/protein network, which showed that the five genes are in the cross-talk of 24 pathways, providing thus information for mechanistic analysis. 1. Introduction Lung cancer is characterized by the highest rate of cancer mortality in the United States, a rate much higher than that of prostate, colorectal and breast cancer – 86% vs. 7%, 15% and 37%, respectively [1]. The lack of reliable methods for early diagnosis results in lung cancer detection at an advanced stage, when it is too late for an effective treatment. The recent advance in technology made possible large scale screenings of genes and proteins, and produced the first markers for early lung cancer screening and survival prognosis [2-8]. While producing encouraging results, these gene signature studies have also certain limitations. The large groups of genes that are offered as potential markers could hardly be directly applicable to clinical purposes [9], which would be ideally satisfied by a single highly discriminating gene. Gene signature classification and prognostic studies are frequently biased by the previous findings in the field. In our study we proceeded from a different strategy for searching gene signature for early detection of lung adenocarcinoma. While also using microarray databases analysis from more than one microarray database, we search for gene ontology categories that would produce the best basis for identifying small size and high sensitivity gene signatures. The genes from the selected categories are then subjected to correlation analysis, which reduces strongly the pool of potential candidates proceeding from a selected cut-off value for correlation significance. A consensus network is then built from the intersecting genes and significant gene- gene statistical correlations from the databases used. The set of genes thus selected is analyzed to identify the subset of genes that provides the highest accuracy of cancer vs. normal patients’ classification. The final step of the procedure includes a search for a physical and biological equivalent of the network built. A network is constructed from the genes and proteins identified in the previous step, and the information found in public databases for their interactions. The network thus built necessarily includes other proteins, which cross-talk to those preliminary selected. This provides an important gene ontology feedback on the pathways and biological processes involved in the complex integrated carcinogenesis process. 2. Methods 2.1. Microarray and Gene Data mRNA expression data from individuals diagnosed with adenocarcinoma of the lung were acquired from Oncomine [10] and from Genome Expression Omnibus (GEO, 2007) [11]. From Oncomine we acquired U95A platform expression data (12,600 probes) from a classification study of patients with lung malignancies [2] including 62 adenoncarcinomas and 17 normal samples. The 62 adenocarcinomas were selected based on agreement between assessments of two independent pathologists. Samples where one report did not indicate pure adenocarcinoma were excluded, and the same was done with the data for patients with secondary metastasis of a different morphology. This produced a dataset of pure adenocarcinomas with no metastasis, tumor sizes of 1 to 8 cm, and all stages. The Bhattacharjee et al. expression data [2] were thus partitioned into five subsets, four of which containing 15 or 16 tumor samples each randomly distributed 2008 International Conference on BioMedical Engineering and Informatics 978-0-7695-3118-2/08 $25.00 © 2008 IEEE DOI 10.1109/BMEI.2008.258 413 2008 International Conference on BioMedical Engineering and Informatics 978-0-7695-3118-2/08 $25.00 © 2008 IEEE DOI 10.1109/BMEI.2008.258 413