The Bioinformatics of Microarray Gene
Expression Profiling
John N. Weinstein
*
, Uwe Scherf, Jae K. Lee, Satoshi Nishizuka, Fuad Gwadry, Ajay, Kim Bussey,
S. Kim, Lawrence H. Smith, Lorraine Tanabe, Samuel Richman, Jessie Alexander,
Hosein Kouros-Mehr, Alika Maunakea, and William C. Reinhold
Genomics and Bioinformatics Group, Laboratory of Molecular Pharmacology, National Cancer Institute, Bethesda, Maryland
Key terms: gene expression profiling; microarray; biochip; cDNA; oligonucleotide; clustering; clustered image
map Cytometry 47:46 – 49, 2002. © 2001 Wiley-Liss, Inc.
Gene expression profiling will revolutionize biology.
That much is universally agreed. But it’s harder than it
looks. In part, the reasons can be technical—substandard
arrays, low signal:noise ratios for rare transcripts, variable
backgrounds, cross-hybridizations, the difficulty of pro-
cessing clinical materials, and so forth. But more often the
reasons relate to analysis and interpretation of the data.
Inevitably, more time and energy are spent after the
experiments are finished than before.
We can identify a number of necessary tasks in the
analysis of gene expression data, as summarized in Table
1. In the following capsule descriptions, we will focus for
concreteness on the two-color fluorescence technologies
(1), but analogous steps are pertinent to one-color fluo-
rescence and radioactive detection methods as well. With
apologies to the many scientists who have been innova-
tive in this field, we intend, in this short summary, to
indicate requirements and options rather than to give a
comprehensive review or to apportion credit for the var-
ious contributions. The examples will focus primarily on
studies from our laboratory.
Task #1: To establish the computer hardware, soft-
ware, and personnel infrastructure for handling and
analyzing gigabyte or terabyte databases. There must be
somewhere to put the data, and there must be fluent
systems for pulling information into the stream of analysis.
As data have outgrown Excel (Microsoft, Redmond, WA)
spreadsheets, the most common, but by no means only,
answers have been database packages like Sybase (Sybase
Inc., Emeryville, CA) or Oracle (Oracle Corporation, Red-
wood Shores, CA). Sometimes, however, flat file formats
suffice. For many of the highly multivariate analyses, to be
discussed later, hardware speed and memory become
significant issues. Most important, however, is the hu-
man infrastructure. Applied bioinformatics, broadly con-
strued, is practiced by the biologist who is fluent in the
use of public and proprietary database resources or who
will perform data analyses—preferably under the supervi-
sion of a statistically trained individual. Fluency with da-
tabase resources is something that every biologist should
have; microarray data analysis is more specialized. What
might be termed developmental bioinformatics involves
the generation of new algorithms (principally by statisti-
cians or those with expertise in machine learning) and the
creation of new software (principally on the basis of
expertise in computer science). Experience shows that
the best analytical developments arise from close atten-
tion to needs arising from actual experimental data sets
and biologic questions
Task #2: To convert images in pixel form to raw
expression levels. Whether one is reading radioisotopi-
cally tagged cDNA in a phosphoimager or measuring flu-
orescent cDNA with a confocal scanner or CCD camera, it
is necessary to develop effective image processing algo-
rithms (See 2, 3). The specifics depend on the type of
array and detection system used and the quality of the
images. As the technologies improve, uncertainties due to
such factors as inhomogeneity in the spots, irregular back-
ground, scanner artifacts, photobleaching, and lack of
spatial registration between channels are diminishing.
Task #3: To examine the array images for quality
control. This important step is facilitated by software
packages that permit surveys of the array image at various
levels of resolution and permit individual spots to be
examined and compared visually.
Task #4: To preprocess the expression-level data (i.e.,
filter, normalize, and/or standardize it). Generally, the
data must be filtered to eliminate flawed spots and genes
with insufficient patterns or differences among samples.
In the former case, it may be necessary, depending on the
nature of the intended analysis, to use statistical or ma-
chine learning techniques to impute values for the missing
data. The next step is normalization, which usually has
been done in the case of two-color studies by tuning a
calibration factor, either on the basis of total gene expres-
sion in the sample or on the basis of a housekeeping gene
*Correspondence to: John N. Weinstein, National Institutes of Health,
Bldg 37, Rm 4E-28, 9000 Rockville Pike, Bethesda, MD 20892
E-mail: weinstein@dtpax2.ncifcrf.gov
© 2001 Wiley-Liss, Inc.
DOI 10.1002/cyto.10041
Cytometry 47:46 – 49 (2002)