Comparing and Clustering Flow Cytometry Data Lin Liu, Li Xiong, James J. Lu Department of Math/CS Emory University lliu24, lxiong, jlu@mathcs.emory.edu Kim M. Gernert BimCore Emory University gernert@emory.edu Vicki Hertzberg Department of Biostatistics Emory University vhertzb@sph.emory.edu Abstract Flow cytometry technique produces large, multi- dimensional datasets of properties of individual cells that are helpful for biomedical science and clinical research. This paper explores an approach for comparing and clus- tering flow cytometry data. To overcome challenges posed by the irregularities and the high dimensions of the data, we develop a set of data preprocessing techniques to facil- itate effective clustering of flow cytometry data files. We present a set of experiments using real data from the Pro- tective Immunity Project (PIP) showing the effectiveness of the approach. 1. Introduction Flow Cytometry (FCM) [7] is a technique used in clin- ical research for studying the immunological status of pa- tients with vaccines or other immunotherapies, for charac- terizing cancer, HIV/AIDS infection and other diseases, as well as for research and therapy involving stem cell manipu- lations. The technique measures the characteristics of single cells, determined by visible and fluorescent light emissions from the markers on the cells. As the liquid flow moves the suspended, labeled cells past a laser that emits light at a par- ticular wavelength, the specific markers attached to the cell fluoresce. The fluorescence emission from each cell is col- lected, and subsequent electrical events are analyzed on a computer that assigns a fluorescence intensity value to each signal in Flow Cytometry Standard (FCS) data files. Each FCS data file thus consists of multi-parametric descriptions of thousands to millions of individual cells. How such large sets of data points in a highly multidi- mensional space can be efficiently and systematically ana- lyzed represents a basic yet important challenge. The clas- sical method of analyzing cytometry data files is to plot the * The research is partially supported by the Protective Immunity Project through the NIH grant NO1-AI-50025. different combinations of channels (parameters) two at a time in a 2D scatter plot and then select subgroups of cells using gates. The gates are regions that can have any shape, but usually are rectangular. The cells within the gate are in- cluded for further analysis and viewed in another 2D scatter plot with different axis, i.e. other channels. Comparisons between different cytometry datafiles often depends on hu- man inspection of such 2D visualizations of all the different combinations of channels [4]. The major disadvantage of this method is that it is not only tedious, it can also miss po- tential subgroups of cells due to projection of higher dimen- sional data down to two dimensional spaces which makes them indiscernible as separate clusters. In addition, the shape, size and position of the gate are largely dependent on the experience and expectations of the researcher. Recently, some clustering algorithms and tools have been developed for FCM data that cluster cells into cell groups based on their intensity patterns using all dimensions (channels) at once [11, 8, 1]. However, comparison between different cytometry datafiles still remains a challenge as it is not straightforward how the clusters of cells can be directly compared across different data files. In this paper, we explore an approach for comparing and clustering FCS datafiles or samples. Such sample-based clustering presents a number of challenges due to the high dimensions and irregularities of the data. First, while there may be only tens to hundreds of samples available, the space of potential features consists of thousands to millions of cell intensity data values at multiple channels. This in- duces an extraordinarily large search space for the param- eters of the model. Second, the cells are not ordered uni- formly across samples; they may be in any random order. This makes feature modeling a challenge as the data (cell intensity values) are not directly comparable across samples because of the unknown order of the cells. To address the above challenges, we developed a set of data preprocessing techniques to facilitate effective cluster- ing of FCS data files. The key is to summarize each FCS data file in a way so that they can be compared and clustered