NORMALIZATION, BASELINE CORRECTION AND ALIGNMENT OF HIGH-THROUGHPUT MASS SPECTROMETRY DATA Anne C. Sauve # and Terence P. Speed # * # Department of Statistics, University of California, Berkeley, *Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute, Australia ABSTRACT We propose several preprocessing steps to be used before biomarker clustering or classifying for high-throughput Mass Spectrometry (MS) data. These preprocessing steps for the mass spectra are multiple alignment of technical replicates, baseline correction and normalization along the mass/charge axis. While the benefits from baseline correction and alignment seem obvious we studied more carefully the benefit from normalizing using some human prostate cancer SELDI TOF MS data (obtained from the Virginia Prostate Center Tissue and body Fluid Bank and approved by the Eastern Virginia Medical School). We show on these data that our global normalization by scaling helps in distinguishing between different cancer groups as well as between cancer and non-cancer groups. We used the Between to Within sum of squares ratio introduced by Fisher as well as visual inspection to illustrate the improvement brought by the normalization. 1. INTRODUCTION Retentate chromatography has been established as an important method to fractionate biological fluid sample. Chromatographic systems can be coupled to more sophisticated detection devices such as Mass Spectrometers. Together they have the capacity to generate voluminous spectra characterized by their retention axis and mass axis and quantified by the measured intensities [11]. Today’s high-throughput MS detectors can generate numerous such spectra per patient. Because of non-linearity in the detector response, ionization suppression, minor changes in the mobile phase composition and interaction between analytes, undesirable variation may get introduced in the MS data. In order to perform an analysis of the entire collected data it is a prerequisite to align the group replicate profiles. In this paper we present three pre-processing steps that seem particularly important for the further analysis of the MS data. These three steps are multiple alignment, baseline correction and normalization. The next sections provide more details about these three steps. These three steps have been addressed independently by [14]. 2. DATA The data are borrowed from and fully described in [1]. To summarize: the serum samples were obtained from the Virginia Prostate Center Tissue and Body Fluid Bank. The patient cohort was procured from the department of Urology, Eastern Virginia Medical School and the healthy men cohort was obtained from free screening clinics open to the general public. 2.1. SELDI-TOF technology The protein profilings were obtained with Ciphergen Biosystems, Inc. Surface-enhanced laser desorption / ionization – time of flight (SELDI-TOF) mass spectrometry. Retentate chromatography is performed on the Ciphergen protein chips by varying chromatographic properties using different affinity surface chemistries (eg anion exchange, cation exchange). For this study, only the IMAC-Cu (Immobilized Metal Affinity Capture) metal binding protein chips were selected for their good serum profiles in terms of number and resolution of proteins. 2.2. Quality control and data selection Despite the use of the same IMAC-Cu metal binding protein chip for all the profiles, we saw quite a big difference among profiles even in the same treatment group. Therefore we removed the spectra that visually were too different from a target spectrum previously chosen, keeping only 5 spectra in each group. In a clinical study setting, stringent QC procedures have been proven necessary for the Seldi-TOF IMAC ProteinChip [4]. 3. MULTIPLE PAIRWISE ALIGNMENT OF MS SPECTRA USING DYNAMIC PROGRAMMING As mentioned earlier, drifts that do not reflect any real sample variations get introduced along the mass axis between replicates. These need to be corrected for. Our proposed dynamic programming (DP) multiple alignment method addresses this problem.