The Bioinformatics of Microarray Gene Expression Profiling John N. Weinstein * , Uwe Scherf, Jae K. Lee, Satoshi Nishizuka, Fuad Gwadry, Ajay, Kim Bussey, S. Kim, Lawrence H. Smith, Lorraine Tanabe, Samuel Richman, Jessie Alexander, Hosein Kouros-Mehr, Alika Maunakea, and William C. Reinhold Genomics and Bioinformatics Group, Laboratory of Molecular Pharmacology, National Cancer Institute, Bethesda, Maryland Key terms: gene expression profiling; microarray; biochip; cDNA; oligonucleotide; clustering; clustered image map Cytometry 47:46 – 49, 2002. © 2001 Wiley-Liss, Inc. Gene expression profiling will revolutionize biology. That much is universally agreed. But it’s harder than it looks. In part, the reasons can be technical—substandard arrays, low signal:noise ratios for rare transcripts, variable backgrounds, cross-hybridizations, the difficulty of pro- cessing clinical materials, and so forth. But more often the reasons relate to analysis and interpretation of the data. Inevitably, more time and energy are spent after the experiments are finished than before. We can identify a number of necessary tasks in the analysis of gene expression data, as summarized in Table 1. In the following capsule descriptions, we will focus for concreteness on the two-color fluorescence technologies (1), but analogous steps are pertinent to one-color fluo- rescence and radioactive detection methods as well. With apologies to the many scientists who have been innova- tive in this field, we intend, in this short summary, to indicate requirements and options rather than to give a comprehensive review or to apportion credit for the var- ious contributions. The examples will focus primarily on studies from our laboratory. Task #1: To establish the computer hardware, soft- ware, and personnel infrastructure for handling and analyzing gigabyte or terabyte databases. There must be somewhere to put the data, and there must be fluent systems for pulling information into the stream of analysis. As data have outgrown Excel (Microsoft, Redmond, WA) spreadsheets, the most common, but by no means only, answers have been database packages like Sybase (Sybase Inc., Emeryville, CA) or Oracle (Oracle Corporation, Red- wood Shores, CA). Sometimes, however, flat file formats suffice. For many of the highly multivariate analyses, to be discussed later, hardware speed and memory become significant issues. Most important, however, is the hu- man infrastructure. Applied bioinformatics, broadly con- strued, is practiced by the biologist who is fluent in the use of public and proprietary database resources or who will perform data analyses—preferably under the supervi- sion of a statistically trained individual. Fluency with da- tabase resources is something that every biologist should have; microarray data analysis is more specialized. What might be termed developmental bioinformatics involves the generation of new algorithms (principally by statisti- cians or those with expertise in machine learning) and the creation of new software (principally on the basis of expertise in computer science). Experience shows that the best analytical developments arise from close atten- tion to needs arising from actual experimental data sets and biologic questions Task #2: To convert images in pixel form to raw expression levels. Whether one is reading radioisotopi- cally tagged cDNA in a phosphoimager or measuring flu- orescent cDNA with a confocal scanner or CCD camera, it is necessary to develop effective image processing algo- rithms (See 2, 3). The specifics depend on the type of array and detection system used and the quality of the images. As the technologies improve, uncertainties due to such factors as inhomogeneity in the spots, irregular back- ground, scanner artifacts, photobleaching, and lack of spatial registration between channels are diminishing. Task #3: To examine the array images for quality control. This important step is facilitated by software packages that permit surveys of the array image at various levels of resolution and permit individual spots to be examined and compared visually. Task #4: To preprocess the expression-level data (i.e., filter, normalize, and/or standardize it). Generally, the data must be filtered to eliminate flawed spots and genes with insufficient patterns or differences among samples. In the former case, it may be necessary, depending on the nature of the intended analysis, to use statistical or ma- chine learning techniques to impute values for the missing data. The next step is normalization, which usually has been done in the case of two-color studies by tuning a calibration factor, either on the basis of total gene expres- sion in the sample or on the basis of a housekeeping gene *Correspondence to: John N. Weinstein, National Institutes of Health, Bldg 37, Rm 4E-28, 9000 Rockville Pike, Bethesda, MD 20892 E-mail: weinstein@dtpax2.ncifcrf.gov © 2001 Wiley-Liss, Inc. DOI 10.1002/cyto.10041 Cytometry 47:46 – 49 (2002)