e-Science Environment for Objective Analysis of Flow Cytometry Data Kaja M. Abbas, Yeonok Lee, Hulin Wu University of Rochester, Rochester NY 14642 Email: {kaja abbas, yeonok lee}@urmc.rochester.edu wu@bst.rochester.edu Gregor von Laszewski Rochester Institute of Technology, Rochester, NY 14623 Email: laszewski@gmail.com Abstract Flow cytometry is an analytical tool used in biomedi- cal experiments of high-throughput assays. Polychromatic flow cytometers have improved in instrumentation to mea- sure more than 20 parameters per cell. Gating analysis of sub-populations in the multivariate data is yet to match the progress in instrumentation. The subjective nature of gat- ing by different analysts leads to varied results and statis- tical analysis for the same data sets. This bioinformatics problem needs to be addressed by formulating standards for analysis and statistical inferences. A web-based portal will allow access to methods that address the computational complexity of flow cytometry data analysis as well as pro- vide a common framework for standardized analysis across multiple research labs at remote geographic locations. We are analyzing and comparing objective clustering methods for analyzing flow cytometry data. Cloud and grid com- puting provide the requisite high performance computing framework to support the analysis through the web portal and a cloudshell. 1 Introduction Flow cytometry (FCM) is an important laboratory tech- nique for studying cells and cellular processes in a spectrum of fields, including immunology and marine biology [1]. Flow cytometry provides both quantitative and qualitative information on fluorescently tagged cells. Fluorescence for each cell is measured as individual cells pass through mul- tiple lasers in a fluid stream. This is useful to study the physical and chemical properties of single cells. Modern flow cytometers resolve 20 or more fluorescent parameters for millions of individual cells [2]. This power provides unparalleled ability to probe multiple parameters; for example, simultaneously determining different surface immunophenotypes. Two major data processing functions are performed on the raw fluorescence and light scattering data, namely compensation and sequential gating. Compen- sation corrects for spectral and cross-laser overlap between different flourophores in each channel. Automated compen- sation is managed adequately in current software packages. Sequential gating narrows down the analysis to specific cell sub-populations. However, state of the art gating analysis is done by subjective manual decision making rather than standardized statistical analysis. The subjective nature of manual gating leads to large variations between different analysts of variable skill levels. Manual gating is re-evaluated on each sample, and serious errors may result without review of the entire gating strat- egy for all samples. Manual gating on successive univariate or bivariate plots ignore the multivariate nature of data and corresponding joint distribution of multiple parameters. The primary bottlenecks in FCM data analysis are the subjective and labor intensive manual gating process and non-reproducibility of consistent results between different analysts. The field of flow cytometry needs objective and automated gating methods to establish standardized repro- ducible results and enable high throughput data analysis, thereby leading to efficiently inferring useful and meaning- ful information from FCM data [3]. 2. Background Flow cytometry and microarrays generate large datasets of similar dimensions though in a sense, the dimensions are transpose of each other. Microarrays have tens of thou- sands of gene predictor variables and dozens of observa- tions whereas FCM measures dozens of parameters on mil- lions of cells. However, statistical analysis of FCM data has had limited impact compared to microarrays. The comput- ing platform to conduct the necessary calculations has been limited to desktop computing for FCM data analysis, while 1