Data-driven Networking Reveals 5-Genes Signature for Early Detection of
Lung Cancer
Vladimir Kuznetsov Sterling Thomas Danail Bonchev
Bioinformatics Institute,
Singapore
Virginia Commonwealth
University
Virginia Commonwealth
University
vladimirk@bii.a-star.edu.sg sthomas@vcu.edu dgbonchev@vcu.edu
Abstract
A new strategy is developed for a gene signature
search proceeding from a biological basis in analyzing
microarray databases. The procedure involves a
combination of known and original methods for
correlation, statistical, and network analysis. The
application of the strategy to lung adenocarcinoma
resulted in a 5-gene signature, which included four
genes not associated earlier with lung
adenocarcinoma. 93-96% accuracy of classification of
cancer vs. normal was achieved. The final stage of our
procedure included expanding of the gene signature
network to a 43 gene/protein network, which showed
that the five genes are in the cross-talk of 24 pathways,
providing thus information for mechanistic analysis.
1. Introduction
Lung cancer is characterized by the highest rate of
cancer mortality in the United States, a rate much
higher than that of prostate, colorectal and breast
cancer – 86% vs. 7%, 15% and 37%, respectively [1].
The lack of reliable methods for early diagnosis results
in lung cancer detection at an advanced stage, when it
is too late for an effective treatment. The recent
advance in technology made possible large scale
screenings of genes and proteins, and produced the first
markers for early lung cancer screening and survival
prognosis [2-8].
While producing encouraging results, these gene
signature studies have also certain limitations. The
large groups of genes that are offered as potential
markers could hardly be directly applicable to clinical
purposes [9], which would be ideally satisfied by a
single highly discriminating gene. Gene signature
classification and prognostic studies are frequently
biased by the previous findings in the field.
In our study we proceeded from a different strategy
for searching gene signature for early detection of lung
adenocarcinoma. While also using microarray
databases analysis from more than one microarray
database, we search for gene ontology categories that
would produce the best basis for identifying small size
and high sensitivity gene signatures. The genes from
the selected categories are then subjected to correlation
analysis, which reduces strongly the pool of potential
candidates proceeding from a selected cut-off value for
correlation significance. A consensus network is then
built from the intersecting genes and significant gene-
gene statistical correlations from the databases used.
The set of genes thus selected is analyzed to identify
the subset of genes that provides the highest accuracy
of cancer vs. normal patients’ classification. The final
step of the procedure includes a search for a physical
and biological equivalent of the network built. A
network is constructed from the genes and proteins
identified in the previous step, and the information
found in public databases for their interactions. The
network thus built necessarily includes other proteins,
which cross-talk to those preliminary selected. This
provides an important gene ontology feedback on the
pathways and biological processes involved in the
complex integrated carcinogenesis process.
2. Methods
2.1. Microarray and Gene Data
mRNA expression data from individuals diagnosed
with adenocarcinoma of the lung were acquired from
Oncomine [10] and from Genome Expression Omnibus
(GEO, 2007) [11]. From Oncomine we acquired U95A
platform expression data (12,600 probes) from a
classification study of patients with lung malignancies
[2] including 62 adenoncarcinomas and 17 normal
samples. The 62 adenocarcinomas were selected based
on agreement between assessments of two independent
pathologists. Samples where one report did not indicate
pure adenocarcinoma were excluded, and the same was
done with the data for patients with secondary
metastasis of a different morphology. This produced a
dataset of pure adenocarcinomas with no metastasis,
tumor sizes of 1 to 8 cm, and all stages. The
Bhattacharjee et al. expression data [2] were thus
partitioned into five subsets, four of which containing
15 or 16 tumor samples each randomly distributed
2008 International Conference on BioMedical Engineering and Informatics
978-0-7695-3118-2/08 $25.00 © 2008 IEEE
DOI 10.1109/BMEI.2008.258
413
2008 International Conference on BioMedical Engineering and Informatics
978-0-7695-3118-2/08 $25.00 © 2008 IEEE
DOI 10.1109/BMEI.2008.258
413