The Bioinformatics of Microarray Gene Expression Proﬁling John N. Weinstein * , Uwe Scherf, Jae K. Lee, Satoshi Nishizuka, Fuad Gwadry, Ajay, Kim Bussey, S. Kim, Lawrence H. Smith, Lorraine Tanabe, Samuel Richman, Jessie Alexander, Hosein Kouros-Mehr, Alika Maunakea, and William C. Reinhold Genomics and Bioinformatics Group, Laboratory of Molecular Pharmacology, National Cancer Institute, Bethesda, Maryland Key terms: gene expression proﬁling; microarray; biochip; cDNA; oligonucleotide; clustering; clustered image map Cytometry 47:46 – 49, 2002. © 2001 Wiley-Liss, Inc. Gene expression proﬁling will revolutionize biology. That much is universally agreed. But it’s harder than it looks. In part, the reasons can be technical—substandard arrays, low signal:noise ratios for rare transcripts, variable backgrounds, cross-hybridizations, the difﬁculty of pro- cessing clinical materials, and so forth. But more often the reasons relate to analysis and interpretation of the data. Inevitably, more time and energy are spent after the experiments are ﬁnished than before. We can identify a number of necessary tasks in the analysis of gene expression data, as summarized in Table 1. In the following capsule descriptions, we will focus for concreteness on the two-color ﬂuorescence technologies (1), but analogous steps are pertinent to one-color ﬂuo- rescence and radioactive detection methods as well. With apologies to the many scientists who have been innova- tive in this ﬁeld, we intend, in this short summary, to indicate requirements and options rather than to give a comprehensive review or to apportion credit for the var- ious contributions. The examples will focus primarily on studies from our laboratory. Task #1: To establish the computer hardware, soft- ware, and personnel infrastructure for handling and analyzing gigabyte or terabyte databases. There must be somewhere to put the data, and there must be ﬂuent systems for pulling information into the stream of analysis. As data have outgrown Excel (Microsoft, Redmond, WA) spreadsheets, the most common, but by no means only, answers have been database packages like Sybase (Sybase Inc., Emeryville, CA) or Oracle (Oracle Corporation, Red- wood Shores, CA). Sometimes, however, ﬂat ﬁle formats sufﬁce. For many of the highly multivariate analyses, to be discussed later, hardware speed and memory become signiﬁcant issues. Most important, however, is the hu- man infrastructure. Applied bioinformatics, broadly con- strued, is practiced by the biologist who is ﬂuent in the use of public and proprietary database resources or who will perform data analyses—preferably under the supervi- sion of a statistically trained individual. Fluency with da- tabase resources is something that every biologist should have; microarray data analysis is more specialized. What might be termed developmental bioinformatics involves the generation of new algorithms (principally by statisti- cians or those with expertise in machine learning) and the creation of new software (principally on the basis of expertise in computer science). Experience shows that the best analytical developments arise from close atten- tion to needs arising from actual experimental data sets and biologic questions Task #2: To convert images in pixel form to raw expression levels. Whether one is reading radioisotopi- cally tagged cDNA in a phosphoimager or measuring ﬂu- orescent cDNA with a confocal scanner or CCD camera, it is necessary to develop effective image processing algo- rithms (See 2, 3). The speciﬁcs depend on the type of array and detection system used and the quality of the images. As the technologies improve, uncertainties due to such factors as inhomogeneity in the spots, irregular back- ground, scanner artifacts, photobleaching, and lack of spatial registration between channels are diminishing. Task #3: To examine the array images for quality control. This important step is facilitated by software packages that permit surveys of the array image at various levels of resolution and permit individual spots to be examined and compared visually. Task #4: To preprocess the expression-level data (i.e., ﬁlter, normalize, and/or standardize it). Generally, the data must be ﬁltered to eliminate ﬂawed spots and genes with insufﬁcient patterns or differences among samples. In the former case, it may be necessary, depending on the nature of the intended analysis, to use statistical or ma- chine learning techniques to impute values for the missing data. The next step is normalization, which usually has been done in the case of two-color studies by tuning a calibration factor, either on the basis of total gene expres- sion in the sample or on the basis of a housekeeping gene *Correspondence to: John N. Weinstein, National Institutes of Health, Bldg 37, Rm 4E-28, 9000 Rockville Pike, Bethesda, MD 20892 E-mail: weinstein@dtpax2.ncifcrf.gov © 2001 Wiley-Liss, Inc. DOI 10.1002/cyto.10041 Cytometry 47:46 – 49 (2002)