GeneX: An Open Source gene expression database and integrated tool set by H. Mangalam J. Stewart J. Zhou K. Schlauch M. Waugh G. Chen A. D. Farmer G. Colello J. W. Weller Because gene expression profiles are highly sensitive to sample and processing conditions, it is crucial to accurately represent these conditions along with the numeric data in a way that allows the conditions to be part of a query. The GeneX TM project is intended to provide an Open Source database and integrated tool set that will allow researchers to store and evaluate their gene expression data and, moreover, such evaluation will be independent of the technology used to obtain the data. T he genomics revolution is beginning to produce data at a rate that will soon rival that of high- energy physics. Biologists and informaticists are faced with critical issues of how best to organize this information and share it so that it provides the great- est good to the greatest number of researchers. The “genomic” data type, as produced by the var- ious genome projects, is a linear DNA (deoxyribo- nucleic acid) polymer consisting of four basic nucle- otides (A, C, G, T) repeated nonrandomly in strings of up to several million (the length of a chromo- some). 1 In the human genome, there are approxi- mately three billion bases distributed among 23 chro- mosomes, but despite the diversity observed in humans, this sequence is 99.99 percent invariant be- tween individuals. Completion of the Human Ge- nome Project 2 will reveal the precise order of the sequence, both where it is invariant and how it var- ies. This genomic sequence information encodes the range of responses that an organism can deploy to cope with its environment but is itself largely stat- ic: 3 the genome itself does not change, but the in- formation derived from it does. The number of genes in a genome ranges between 20, the number of genes in a virus, and 30000, the number of genes in a human. The average human gene is about 3000 bases in length, so the amount of gene sequence in the human genome that is used for coding genes is quite small, approximately 1 per- cent. Each gene (a contiguous section of DNA) can be transcribed to produce mRNA (messenger ribonu- cleic acid), which in turn is translated into a unique protein (a linear polymer of amino acids). The pro- tein can subsequently be further modified by a va- riety of mechanisms, such as phosphorylation, gly- cosylation, and cleavage. Adding to this complexity, one gene can code for several unique mRNAs via a mechanism called splicing, in which parts of the se- quence (called intervening sequences or introns) are spliced out and the remaining coding sequences (ex- ons) can be “spliced” to form several unique mRNAs. Regulation of gene transcription occurs through a complex feedback mechanism involving a number of pathways, ultimately being mediated by transcrip- tion factors, protein complexes that bind to short reg- ulatory sequences of DNA near the start of transcrip- tion. In contrast to the static genomic DNA, gene expres- sion is the dynamic response of the genome to en- vironmental conditions. As such, gene expression data require more descriptive information (meta- data) to characterize it accurately: the mRNA is ex- Copyright 2001 by International Business Machines Corpora- tion. Copying in printed form for private use is permitted with- out payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copy- right notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. MANGALAM ET AL. 0018-8670/01/$5.00 © 2001 IBM IBM SYSTEMS JOURNAL, VOL 40, NO 2, 2001 552