GeneX:
An Open Source gene
expression database
and integrated tool set
by H. Mangalam
J. Stewart
J. Zhou
K. Schlauch
M. Waugh
G. Chen
A. D. Farmer
G. Colello
J. W. Weller
Because gene expression profiles are highly
sensitive to sample and processing conditions,
it is crucial to accurately represent these
conditions along with the numeric data in a way
that allows the conditions to be part of a query.
The GeneX
TM
project is intended to provide an
Open Source database and integrated tool set
that will allow researchers to store and evaluate
their gene expression data and, moreover, such
evaluation will be independent of the technology
used to obtain the data.
T
he genomics revolution is beginning to produce
data at a rate that will soon rival that of high-
energy physics. Biologists and informaticists are
faced with critical issues of how best to organize this
information and share it so that it provides the great-
est good to the greatest number of researchers.
The “genomic” data type, as produced by the var-
ious genome projects, is a linear DNA (deoxyribo-
nucleic acid) polymer consisting of four basic nucle-
otides (A, C, G, T) repeated nonrandomly in strings
of up to several million (the length of a chromo-
some).
1
In the human genome, there are approxi-
mately three billion bases distributed among 23 chro-
mosomes, but despite the diversity observed in
humans, this sequence is 99.99 percent invariant be-
tween individuals. Completion of the Human Ge-
nome Project
2
will reveal the precise order of the
sequence, both where it is invariant and how it var-
ies. This genomic sequence information encodes the
range of responses that an organism can deploy to
cope with its environment but is itself largely stat-
ic:
3
the genome itself does not change, but the in-
formation derived from it does.
The number of genes in a genome ranges between
20, the number of genes in a virus, and 30000, the
number of genes in a human. The average human
gene is about 3000 bases in length, so the amount
of gene sequence in the human genome that is used
for coding genes is quite small, approximately 1 per-
cent. Each gene (a contiguous section of DNA) can
be transcribed to produce mRNA (messenger ribonu-
cleic acid), which in turn is translated into a unique
protein (a linear polymer of amino acids). The pro-
tein can subsequently be further modified by a va-
riety of mechanisms, such as phosphorylation, gly-
cosylation, and cleavage. Adding to this complexity,
one gene can code for several unique mRNAs via a
mechanism called splicing, in which parts of the se-
quence (called intervening sequences or introns) are
spliced out and the remaining coding sequences (ex-
ons) can be “spliced” to form several unique mRNAs.
Regulation of gene transcription occurs through a
complex feedback mechanism involving a number
of pathways, ultimately being mediated by transcrip-
tion factors, protein complexes that bind to short reg-
ulatory sequences of DNA near the start of transcrip-
tion.
In contrast to the static genomic DNA, gene expres-
sion is the dynamic response of the genome to en-
vironmental conditions. As such, gene expression
data require more descriptive information (meta-
data) to characterize it accurately: the mRNA is ex-
Copyright 2001 by International Business Machines Corpora-
tion. Copying in printed form for private use is permitted with-
out payment of royalty provided that (1) each reproduction is done
without alteration and (2) the Journal reference and IBM copy-
right notice are included on the first page. The title and abstract,
but no other portions, of this paper may be copied or distributed
royalty free without further permission by computer-based and
other information-service systems. Permission to republish any
other portion of this paper must be obtained from the Editor.
MANGALAM ET AL. 0018-8670/01/$5.00 © 2001 IBM IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001
552